I'm newer at python.
I have a dataframe like this
1 2 3
75016 1 2 2
75017 0 0 0
75018 0 2 2
For each row, I want to identify what the columns have the value = 1 or 2
the output =
" 75016 " has = column 1,2,3
" 75017 " has column
" 75018 " has column 2,3
how I can do it?
thank you
Assuming a pandas DataFrame, you can use:
s = df.stack().isin([1,2])
out = (s[s]
.reset_index(1)
.groupby(level=0)['level_1']
.agg(','.join)
.reindex(df.index).fillna('')
)
output:
75016 1,2,3
75017
75018 2,3
Name: level_1, dtype: object
used input:
import pandas as pd
df = pd.DataFrame({'1': [1, 0, 0], '2': [2, 0, 2], '3': [2, 0, 2]},
index=[75016, 75017, 75018])
Related
Reproducible dataframe-
import pandas as pd
data = {'refid': ['1.2.34',
'1.2.35',
'1.3.66',
'1.6.99',
'1.9.00',
'1.87.66',
'1.98.00',
'1.100.1',
'1.101.3'],
}
my_index = pd.MultiIndex.from_arrays([["A"]*6 + ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)
I want a new column which clubs both ID-B and refid to second delimiter.
ex- for ID-B 1 and refid 1.2.34, first the secondary-refid column should be 1.2 and unique ID should be 1_1.2
You can use get_level_values with str.extract and concatenate the values converted as string:
df['new'] = (df.index.get_level_values('ID-B').astype(str)+'_'
+df['refid'].str.extract('(\d+\.\d+)', expand=False)
)
output:
refid new
ID-A ID-B
A 1 1.2.34 1_1.2
1 1.2.35 1_1.2
1 1.3.66 1_1.3
2 1.6.99 2_1.6
2 1.9.00 2_1.9
2 1.87.66 2_1.87
B 1 1.98.00 1_1.98
1 1.100.1 1_1.100
1 1.101.3 1_1.101
My Current Dataframe looks like the following:
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
df
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
How can I now create a new dataframe that is a sumif calculation on the above columns. So far I have the below:
df2 = pd.DataFrame()
df2['CountIfWings'] = (df.num_wings == '2').sum()
But that does not work, however when I just assign it to a variable i am able to see the value:
variable1 = (df.num_wings == '2').sum()
print(variable1)
1
Can anyone assist?
IIUC use DataFrame constructor, because output is scalar:
df2 = pd.DataFrame({'CountIfWings':[(df.num_wings == '2').sum()]})
#if 2 are integers
df2 = pd.DataFrame({'CountIfWings':[(df.num_wings == 2).sum()]})
Dataframe image
the operation that I intend to perform is whenever there is a '2' in the column 3, we need to take that entry and take the column 1 value of that entry and subtract the column 1 value of the previous entry and then multiply the result by a constant integer (say 5).
For example: From the image we have a '2' in column 3 at 6:00 and the value of column 1 for that entry is 0.011333 and take the previous column 1 entry which is 0.008583 and perform the following.
(0.011333 - 0.008583)* 5.
This I want to perform everytime when we receive a '2' in column 3 in a dataframe. Please help. I am not able to get the write code to perform the above operation.
Hope this helps:
You can use df.shift(1) to get the previous row and np.where to get the row satisfying your condition
df = pd.DataFrame([['ABC', 1, 0, 0],
['DEF', 2, 0, 0],
['GHI', 3, 0, 0],
['JKL', 4, 0, 2],
['MNO', 5, 0, 2],
['PQR', 6, 0, 2],
['STU', 7, 0, 0]],
columns=['Date & Time', 'column 1', 'column 2', 'column 3'])
df['new'] = np.where(df['column 3'] == 2, df['column 1'] - df['column 1'].shift(1) * 5, 0)
print(df)
Output:
Date & Time column 1 column 2 column 3 new
0 ABC 1 0 0 0.0
1 DEF 2 0 0 0.0
2 GHI 3 0 0 0.0
3 JKL 4 0 2 -11.0
4 MNO 5 0 2 -15.0
5 PQR 6 0 2 -19.0
6 STU 7 0 0 0.0
You can change your calculations as you want. In the else part you can put np.NaN or any other calculation if you want.
Would something like that do the job ?
dataframe = [
[1,3,6,6,7],
[4,3,5,6,7],
[12,3,2,6,7],
[2,3,7,6,7],
[9,3,5,6,7],
[13,3,2,6,7]
]
constant = 5
list_of_outputs = []
for row in dataframe:
if row[2] == 2:
try:
output = (row[0] - prev_entry) * constant
list_of_outputs.append(output)
except:
print("No previous entry!")
prev_entry = row[0]
Perhaps this question will help you
I think in SQL way, so basically you will make new column that filled with the value from the row above it.
df['column1_lagged'] = df['column 1'].shift(1)
Then you create another column that do the calculation
constant = 5
df['calculation'] = (df['column 1'] - df['column1_lagged'])*constant
After that you just slice the dataframe to your condition (column 3 with '2's)
condition = df['column 3'] == 2
df[condition]
I have a data frame with products on rows and their characteristics.
I would like for every unique value in every characteristics column, to create a new dummy variable, which will have 1 if this specific characteristic value exists for that specific product and 0 otherwise.
As an example:
import pandas as pd
df = pd.DataFrame({'id':['prod_A','prod_A','prod_B','prod_B'],
'color':['red','green','red','black'],
'size':[1,2,3,4]})
and I would like to end up with a data frame like this:
df_f = pd.DataFrame({'id': ['prod_A', 'prod_B'],
'color_red': [1, 1],
'color_green': [1, 0],
'color_black': [0, 1],
'size_1': [1, 0],
'size_2': [1, 0],
'size_3': [0, 1],
'size_4': [0, 1]})
Any ideas ?
Use get_dummies with aggregate max:
#dummies for all columns without `id`
df = pd.get_dummies(df.set_index('id')).max(level=0).reset_index()
#dummies for columns in list
df = pd.get_dummies(df, columns=['color','size']).groupby('id', as_index=False).max()
print (df)
id color_black color_green color_red size_1 size_2 size_3 size_4
0 prod_A 0 1 1 1 1 0 0
1 prod_B 1 0 1 0 0 1 1
So I got this DataFrame, built in a way so that for column id equal to 2, we have two different values in column num and my_date:
import pandas as pd
a = pd.DataFrame({'id': [1, 2, 3, 2],
'my_date': [datetime(2017, 1, i) for i in range(1, 4)] + [datetime(2017, 1, 1)],
'num': [2, 3, 1, 4]
})
For convenience, this is the DataFrame in a readable visual:
If I want to count the number of unique values for each id, I'd do
grouped_a = a.groupby('id').agg({'my_date': pd.Series.nunique,
'num': pd.Series.nunique}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
which gives this weird (?) result:
Looks like the counting unique values on the datetime (which in Pandas converts to a datetime64[ns]) type is not working?
It is bug, see github 14423.
But you can use SeriesGroupBy.nunique which works nice:
grouped_a = a.groupby('id').agg({'my_date': 'nunique',
'num': 'nunique'}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
If DataFrame have only 3 columns, you can use:
grouped_a = a.groupby('id').agg(['nunique']).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1