Generating new columns as a full-combination of other columns - python

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.

It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]

Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4

Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Python alternative to R mutate

I want to convert R code into Python. The code in R is
df %>% mutate(N = if_else(Interval != lead(Interval) | row_number() == n(), criteria/Count, NA_real_))
In Python I wrote the following:
import pandas as pd
import numpy as np
df = pd.read_table('Fd.csv', sep=',')
for i in range(1,len(df.Interval)-1):
x = df.Interval[i]
n = df.Interval[i+1]
if x != n | x==df.Interval.tail().all():
df['new']=(df.criteria/df.Count)
else:
df['new']='NaN'
df.to_csv (r'dataframe.csv', index = False, header=True)
However, the output returns all NaNs.
Here is what the data looks like
Interval | Count | criteria
0 0 0
0 1 0
0 2 0
0 3 0
1 4 1
1 5 2
1 6 3
1 7 4
2 8 1
2 9 2
3 10 3
and this is what I want to get ( I also need to consider the last line)
Interval | Count | criteria | new
0 0 0
0 1 0
0 2 0
0 3 0 0
1 4 1
1 5 2
1 6 3
1 7 4 0.5714
2 8 1
2 9 2 0.2222
3 10 3 0.3333
If anyone could help find my mistake, I would greatly appreciate.
1. Start indexing at 0
The first thing to note is that Python starts indexing at 0 (in contrast to R which starts at 1). Therefore, you need to modify the index range of your for-loop.
2. Specify row indices
When calling
df['new']=(df.criteria/df.Count)
or
df['new']='NaN'
you are setting/getting all the values in the "new" column. However, you intend to set the value only in some rows. Therefore, you need to specify the row.
3. Working example
import pandas as pd
df = pd.DataFrame()
df["Interval"] = [0,0,0,0,1,1,1,1,2,2,3]
df["Count"] = [0,1,2,3,4,5,6,7,8,9,10]
df["criteria"] = [0,0,0,0,1,2,3,4,1,2,3]
df["new"] = ["NaN"] * len(df.Interval)
last_row = len(df.Interval) - 1
for row in range(0, len(df.Interval)):
current_value = df.Interval[row]
next_value = df.Interval[min(row + 1, last_row)]
if (current_value != next_value) or (row == last_row):
result = df.loc[row, 'criteria'] / df.loc[row, 'Count']
df.loc[row, 'new'] = result

"Zipping" two dataframes by column values

all
Suppose I have a dataframe like:
df1 = pd.DataFrame({"A": range(6), "key": [0,1]*3})
df1
A key
0 0 0
1 1 1
2 2 0
3 3 1
4 4 0
5 5 1
and
df2 = pd.DataFrame({"C": ["k0-"+str(x) for x in range(3)] + ["k1-"+str(x) for x in range(3)] , "key": [0]*3 + [1]*3}) k0-1
df2
C key
0 k0-0 0
1 k0-1 0
2 k0-2 0
3 k1-0 1
4 k1-1 1
5 k1-2 1
Values in C are all unique and values in key have no such pattern in a real dataset.
I'm trying to merge the two with a resulting dataframe, where values in column C will be taken exactly once for a matching value in column key.
I.e.
A key C
0 0 0 k0-0
1 1 1 k1-0
2 2 0 k0-1
3 3 1 k1-1
4 4 0 k0-2
5 5 1 k1-2
The order doesn't matter, i.e. values in C do not need to be taken sequentially. This is a toy example, I have ~10 keys in reality.
I know I can probably do an outer join and then somehow drop the non-unique C values. But this could be an overkill, as there are too many rows in the real datasets (~30k).
Thanks in advance!
You can add an extra column to be used in the join:
df1['order'] = df1.groupby('key').cumcount()
df2['order'] = df2.groupby('key').cumcount()
# If you want to match on random order:
# df2['order'] = df2.sample(frac=1).groupby('key').cumcount()
df1.merge(df2, on=['key', 'order'])
Result:
A key order C
0 0 0 0 k0-0
1 1 1 0 k1-0
2 2 0 1 k0-1
3 3 1 1 k1-1
4 4 0 2 k0-2
5 5 1 2 k1-2
You can build a dictionary of iterators and call next on the appropriate iterator depending on the 'key'.
g = {k: iter(v) for k, v in df2.groupby('key').C}
df1.assign(C=[next(g[x]) for x in df1.key])
A key C
0 0 0 k0-0
1 1 1 k1-0
2 2 0 k0-1
3 3 1 k1-1
4 4 0 k0-2
5 5 1 k1-2

Counting number of consecutive more than 2 occurences

I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]

pandas DataFrame create,access,append MultiIndex with different Column types - SQL table style

I thought by using the huge pandas.DataFrame library it should be pretty straight forward to do all the standard stuff you can do with an SQL table .. but after looking into many options I still haven't found a good working solution.
Requirements:
table with a 4 columns with different data types (uint32, string,...) , 3 off them should work as index
many (>10k) additional columns of type int8
initially I had the idea to add rows and columns dynamically, but that turned out to be very slow (using df.at[row, col] = y)
I ended up creating a DataFrame with a few columns with different types and join it with another large DataFrame created from a numpy array with elements of type uint8
... that looked quite good, but now nothing works to access, add or set array elements using the index
import numpy as np
import pandas as pd
# create DataFrame
idx_names = ['A','B','C']
col_names = ['y']
df = pd.DataFrame(columns = idx_names + col_names)
# create DataFrame from numpy array
npa = np.zeros((5,10),dtype=np.uint8)
dfa = pd.DataFrame(npa)
# add DataFrames column-wise
t = pd.concat([df,dfa], axis=1)
# set index columns
t.set_index(idx_names,inplace=True)
y 0 1 2 3 4 5 6 7 8 9
A B C
NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
Now I would like to set values in the columns (y,0, ...9) by providing an index.
If the index is not already available it should be added to the table.
t( (t['A']='US',t['B']='CA',t['C']='SFO') , 'y') = "IT"
t( (t['A']='US',t['B']='CA',t['C']='LA' ) , '1') = 255
Assuming you have the following multi-index DataFrame:
In [44]: df
Out[44]:
d
a b c
0 0 1 1
4 4 4 3
0 1 4 4
2 6 1 3
0 1 3 6
and you want to add the following 2D array as 10 new columns:
In [45]: data
Out[45]:
array([[ 0.76021523, 0.92020945, 0.20205685, 0.03888115, 0.41166093, 0.67509844, 0.15351393, 0.00926459, 0.09297956, 0.72930072],
[ 0.38229582, 0.88199428, 0.08153019, 0.08367272, 0.88548522, 0.50332168, 0.94652147, 0.83362442, 0.219431 , 0.09399454],
[ 0.43743926, 0.79447959, 0.18430898, 0.31534202, 0.63229928, 0.80921108, 0.76570853, 0.09890863, 0.33604303, 0.92960105],
[ 0.6561763 , 0.26731786, 0.1266551 , 0.78960943, 0.900017 , 0.02468355, 0.99110764, 0.40402032, 0.46224193, 0.44569296],
[ 0.1509643 , 0.26830514, 0.69337022, 0.1339183 , 0.42711838, 0.0883597 , 0.6923594 , 0.01451872, 0.56684861, 0.46792245]])
Solution:
In [47]: df = df.join(pd.DataFrame(data, index=df.index))
In [48]: df
Out[48]:
d 0 1 2 3 4 5 6 7 8 9
a b c
0 0 1 1 0.760215 0.920209 0.202057 0.038881 0.411661 0.675098 0.153514 0.009265 0.092980 0.729301
4 4 4 3 0.382296 0.881994 0.081530 0.083673 0.885485 0.503322 0.946521 0.833624 0.219431 0.093995
0 1 4 4 0.437439 0.794480 0.184309 0.315342 0.632299 0.809211 0.765709 0.098909 0.336043 0.929601
2 6 1 3 0.656176 0.267318 0.126655 0.789609 0.900017 0.024684 0.991108 0.404020 0.462242 0.445693
0 1 3 6 0.150964 0.268305 0.693370 0.133918 0.427118 0.088360 0.692359 0.014519 0.566849 0.467922

Categories