I have a pandas data frame
sample dataframe
df = a1 a2 a3 a4 a5
0 1 1 1 0 #dict[a3_a4] = 1 ,dict[a2_a4] = 1 ,dict[a2_a3] = 1
1 1 1 0 0 #dict[a1_a2] = 1 , dict[a1_a3] = 1, dict[a2_a3] = 1
I need function gets data frame as input and return the number of appearing of 2 columns together and store it in the dictionary
so my output will be like
output dict will look like this : {'a1_a2':1,'a2_a3':2, 'a3_a4':1,'a1_a3':1,'a2_a4':1}
Pseudo code if needed
PS: I am new to stack overflow so forgive me for my mistakes.
You can use itertools combinations to get all the pairs of columns. Then you can multiply up the values and take the sum of them.
from itertools import combinations
cc = list(combinations(df.columns,2))
df1 = pd.concat([df[c[1]]*df[c[0]] for c in cc], axis=1, keys=cc)
df1.columns = df1.columns.map('_'.join)
d = df1.sum().to_dict()
print(d)
Output:
{'a1_a2': 1,
'a1_a3': 1,
'a1_a4': 0,
'a1_a5': 0,
'a2_a3': 2,
'a2_a4': 1,
'a2_a5': 0,
'a3_a4': 1,
'a3_a5': 0,
'a4_a5': 0}
Related
I would like to add multiple columns programmatically to a dataframe using pre-defined rules. As an example, I would like to add 3 columns to the below dataframe, based on whether or not they satisfy the three rules indicated in code below:
#define dataframe
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
#define "rules" for adding subsequent columns
rule_1 = (df1["time1"] == 1)
rule_2 = (df1["time2"] == 1)
rule_3 = (df1["time3"] == 1)
#add new columns based on whether or not above rules are satisfied
df1["rule_1"] = np.where(rule_1, 1, 0)
df1["rule_2"] = np.where(rule_2, 1, 0)
df1["rule_3"] = np.where(rule_3, 1, 0)
As you can see my approach gets tedious when I need to add 10s of columns - each based on a different "rule" - to a test dataframe.
Is there a way to do this more easily without defining each column manually along with its individual np.where clause? I tried doing something like this, but pandas does not accept this.
rules = [rule_1, rule_2, rule_3]
for rule in rules:
df1[rule] = np.where(rule, 1, 0)
Any ideas on how to make my approach more programmatically efficient?
The solution you provided doesn't work because you are using the rule element as the new dataframe column for the rule. I would solve it like this:
rules = [rule_1, rule_2, rule_3]
for i, rule in enumerate(rules):
df1[f'rule_{i+1}'] = np.where(rule, 1, 0)
Leverage pythons f strings in a for loop. They are good at this
#Create a list by filtering the time columns
cols=list(df1.filter(regex='time', axis=1).columns)
#Iterate through the list of columns imposing your conditions using np.where
for col in cols:
df1[f'{col}_new'] = df1[f'{col}'].apply(lambda x:np.where(x==1,1,0))
I might be oversimplifying your rules, but something like:
rules = [
('item1', 1),
('item2', 1),
('item3', 1),
]
for i, (col, val) in enumerate(rules):
df[f"rule_{i + 1}"] = np.where(df[col] == val, 1, 0)
If all of your rules check the same thing, maybe this could be helpful: unstack the relevant columns and check the condition on the Series and convert back to DataFrame with unstack:
df1[['rule1','rule2','rule3']] = df1[['time1','time2','time3']].unstack().eq(1).astype(int).swaplevel().unstack()
Output:
time1 time2 time3 outcome rule1 rule2 rule3
0 0 1 0 1 0 1 0
1 1 0 0 0 1 0 0
2 1 0 0 0 1 0 0
3 0 0 1 1 0 0 1
4 0 1 0 0 0 1 0
I have the following function for getting the column name of last non-zero value of a row
import pandas as pd
def myfunc(X, Y):
df = X.iloc[Y]
counter = len(df)-1
while counter >= 0:
if df[counter] == 0:
counter -= 1
else:
break
return(X.columns[counter])
Using the following code example
data = {'id': ['1', '2', '3', '4', '5', '6'],
'name': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'GGG'],
'A1': [1, 1, 1, 0, 1, 1],
'B1': [0, 0, 1, 0, 0, 1],
'C1': [1, 0, 1, 1, 0, 0],
'A2': [1, 0, 1, 0, 1, 0]}
df = pd.DataFrame(data)
df
myfunc(df, 5) # 'B1'
I would like to know how can I apply this function to all rows in a dataframe, and put the results into a new column of df
I am thinking about looping across all rows (which probably is not the good approach) or using lambdas with apply function. However, I have not suceed with this last approach. Any help?
I've modified your function a little bit to work across rows:
def myfunc(row):
counter = len(row)-1
while counter >= 0:
if row[counter] == 0:
counter -= 1
else:
break
return row.index[counter]
Now just call df.apply your function and axis=1 to call the function for each row of the dataframe:
>>> df.apply(myfunc, axis=1)
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
However, you can ditch your custom function and use this code to do what you're looking for in a much faster and more concise manner:
>>> df[df.columns[2:]].T.cumsum().idxmax()
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
Here is a simpler and faster solution using DataFrame.idxmax.
>>> res = df.iloc[:, :1:-1].idxmax(axis=1)
>>> res
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
The idea is to select only the Ai and Bi columns and reverse the order of them (df.iloc[:, :1:-1]) and then return the column label of the first occurrence of maximum (1 in this case) for each row (.idxmax(axis=1)).
Note that this solution (as the other answer) assumes that each row contains at least one entry higher than zero.
This assumption can be relaxed to 'each row contains at least one non-zero entry' if we first mask the non-zero entries (using .ne(0)). This works because .ne(0) produces a boolean mask and True > False <=> 1 > 0.
>>> res = df.iloc[:, :1:-1].ne(0).idxmax(axis=1)
res
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4
So I got this DataFrame, built in a way so that for column id equal to 2, we have two different values in column num and my_date:
import pandas as pd
a = pd.DataFrame({'id': [1, 2, 3, 2],
'my_date': [datetime(2017, 1, i) for i in range(1, 4)] + [datetime(2017, 1, 1)],
'num': [2, 3, 1, 4]
})
For convenience, this is the DataFrame in a readable visual:
If I want to count the number of unique values for each id, I'd do
grouped_a = a.groupby('id').agg({'my_date': pd.Series.nunique,
'num': pd.Series.nunique}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
which gives this weird (?) result:
Looks like the counting unique values on the datetime (which in Pandas converts to a datetime64[ns]) type is not working?
It is bug, see github 14423.
But you can use SeriesGroupBy.nunique which works nice:
grouped_a = a.groupby('id').agg({'my_date': 'nunique',
'num': 'nunique'}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
If DataFrame have only 3 columns, you can use:
grouped_a = a.groupby('id').agg(['nunique']).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
Im trying to sort this data to go from this:
to this:
Basically I'm trying to compress 5 rows of data, each with 1 ID and 2 values into 1 row of data with 1 ID and 10 values. My data is approx. 6 million rows long. One thing to note: not every group has 5 (X,Y) coordinate values. Some only have 4.
I could not figure out how to do this by indexing alone. So i wrote a for loop, which doesnt work very well. It will sort the first 10,000 ok (but end with an error), but it takes forever.
coords = pd.read_csv('IDQQCoords.csv')
coords = coords.as_matrix(columns=None)
mpty = np.zeros((len(coords),8),dtype=float)
#creates an empty array the same length as coords
coords = np.append(coords,mpty,axis=1)
# adds the 8 empty columns from the previous command
#This is to make space to add the values from subsequent rows
cnt = 0
lth = coords.shape[0]
for counter in range(1,lth):
if coords[cnt+1,0] == coords[cnt,0]:
coords[cnt,3:5] = coords[cnt+1,1:3]
coords = np.delete(coords,cnt+1,axis=0)
if coords[cnt+1,0] == coords[cnt,0]:
coords[cnt,5:7] = coords[cnt+1,1:3]
coords = np.delete(coords,cnt+1,axis=0)
if coords[cnt+1,0] == coords[cnt,0]:
coords[cnt,7:9] = coords[cnt+1,1:3]
coords = np.delete(coords,cnt+1,axis=0)
if coords[cnt+1,0] == coords[cnt,0]:
coords[cnt,9:11] = coords[cnt+1,1:3]
coords = np.delete(coords,cnt+1,axis=0)
cnt = cnt+1
Can someone help me, either with an index or a better loop?
Thanks a ton
Assuming that
coords = pd.read_csv('IDQQCoords.csv')
implies that you are using Pandas, then the easiest way to produce the desired result is to use DataFrame.pivot:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'shapeid': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'x': np.random.random(14),
'y': np.random.random(14)})
# shapeid x y
# 0 0 0.896705 0.603638
# 1 0 0.730239 0.588791
# 2 0 0.783276 0.069347
# 3 0 0.741652 0.942829
# 4 0 0.462090 0.372599
# 5 1 0.642565 0.451989
# 6 1 0.224864 0.450841
# 7 1 0.708547 0.033112
# 8 1 0.747126 0.169423
# 9 2 0.625107 0.180155
# 10 2 0.579956 0.352746
# 11 2 0.242640 0.342806
# 12 2 0.131956 0.277638
# 13 2 0.143948 0.375779
df['col'] = df.groupby('shapeid').cumcount()
df = df.pivot(index='shapeid', columns='col')
df = df.sort_index(axis=1, level=1)
df.columns = ['{}{}'.format(col, num) for col,num in df.columns]
print(df)
yields
x0 y0 x1 y1 x2 y2 x3 \
shapeid
0 0.896705 0.603638 0.730239 0.588791 0.783276 0.069347 0.741652
1 0.642565 0.451989 0.224864 0.450841 0.708547 0.033112 0.747126
2 0.625107 0.180155 0.579956 0.352746 0.242640 0.342806 0.131956
y3 x4 y4
shapeid
0 0.942829 0.462090 0.372599
1 0.169423 NaN NaN
2 0.277638 0.143948 0.375779