Merging two pandas data frame with common columns - python

I have a lower triangular matrix and then I transpose it and I have the transpose of it.
I am trying to merge them together
lower triangular:
Data :
0 1 2 3
0 1 0 0 0
1 0.21 0 0 0
2 0.31 0.32 0 0
3 0.41 0.42 0.43 0
4 0.51 0.52 0.53 0.54
transpose triangular:
Data :
0 1 2 3
0 1 0.21 0.31 0.41
1 0 0 0.32 0.52
2 0 0 0 0.53
3 0 0 0 0.54
4 0 0 0 0
Merged matrix:
Data :
0 1 2 3 4
0 1 0.21 0.31 0.41 0.51
1 0.21 0 0.32 0.42 0.52
2 0.31 0.32 0 0.43 0.53
3 0.41 0.42 0.43 0 0.54
4 0.51 0.52 0.53 0.54 0
I tried using pd.merge but I couldn't get it to work

Let us using combine_first after mask
df.mask(df==0).T.combine_first(df).fillna(0)
Out[1202]:
0 1 2 3 4
0 1.00 0.21 0.31 0.41 0.51
1 0.21 0.00 0.32 0.42 0.52
2 0.31 0.32 0.00 0.43 0.53
3 0.41 0.42 0.43 0.00 0.54
4 0.51 0.52 0.53 0.54 0.00

How about just adding the two dataframes?
df3 = df1.add(df2, fill_value=0)
BR

Related

Find number of datapoints in each range

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

How to combine Date object with float Vector

I am reading two data frames from two separate csvs and trying to combine them in a single data frame.Both df1 & df2 should be combined row by row.df1 contains floating numbers and
df2 is a date.
df1=pd.read_csv("Weights.csv")
print(df1.head(5))
df2=pd.read_csv("Date.csv")
print(df2.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.06 0.06 -0.0 -0.0 0.11 0.06 0.37 0.01 0.05 0.10 -0.00 0.01 0.0
1 0.09 0.05 -0.0 -0.0 0.12 0.05 0.36 0.00 0.05 0.08 0.00 0.00 -0.0
2 0.14 0.07 -0.0 0.0 0.13 0.04 0.33 0.01 0.04 0.05 0.00 0.00 0.0
3 0.13 0.07 0.0 -0.0 0.12 0.06 0.34 0.01 0.05 0.04 0.01 0.00 -0.0
4 0.11 0.08 0.0 0.0 0.08 0.10 0.35 0.05 0.05 0.06 0.02 0.00 0.0
0
0 2010-12-29
1 2011-01-05
2 2011-01-12
3 2011-01-19
4 2011-01-26
I am facing problem using pd.concat in pandas.

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

How can I delete a row if a certain amount of values are missing? [duplicate]

This question already has answers here:
Python - Drop row if two columns are NaN
(5 answers)
How to drop column according to NAN percentage for dataframe?
(4 answers)
Closed 4 years ago.
I have the following dataset:
A B C D
0 0.46 0.23 NaN 0.41
1 0.65 0.48 0.07 0.15
2 NaN 1.00 0.79 0.09
3 0.50 0.97 0.07 0.55
4 0.45 0.44 0.23 0.41
5 NaN 0.39 NaN 0.31
6 0.32 0.87 0.73 0.57
7 0.82 0.67 0.73 0.19
8 0.65 NaN NaN 0.81
9 0.36 0.23 1.00 0.51
I would like to delete rows in which there are 2 or more values missing (or specify as more than 50% missing?), therefore I would like to delete rows 5 and 8 and get the following output:
A B C D
0 0.46 0.23 NaN 0.41
1 0.65 0.48 0.07 0.15
2 NaN 1.00 0.79 0.09
3 0.50 0.97 0.07 0.55
4 0.45 0.44 0.23 0.41
6 0.32 0.87 0.73 0.57
7 0.82 0.67 0.73 0.19
9 0.36 0.23 1.00 0.51
Thank you.

Pandas stack column pairs

I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00

Categories