Create and fill new columns using values in rows pandas

Create and fill new columns using values in rows pandas - python

I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!

B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4

Related

Adding new rows with default value based on dataframe values into dataframe

I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.

Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1

Pandas groupby concat ungrouped column into comma separated string

I have the following example df:
col1 col2 col3 doc_no
0 a x f 0
1 a x f 1
2 b x g 2
3 b y g 3
4 c x t 3
5 c y t 4
6 a x f 5
7 d x t 5
8 d x t 6
I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):
col1 col2 col3 count doc_no
0 a x f 3 0, 1, 5
1 d x t 2 5, 6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
How would I go about doing this? I used the below line to get just the grouping and the count:
grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
.sort_values(['count'], ascending=False).reset_index()
But I'm not sure how to also get the concatenated doc_no column in the same code line.

Try groupby and agg like so:
(df.groupby(['col1', 'col2', 'col3'])['doc_no']
.agg(['count', ('doc_no', lambda x: ','.join(map(str, x)))])
.sort_values('count', ascending=False)
.reset_index())
col1 col2 col3 count doc_no
0 a x f 3 0,1,5
1 d x t 2 5,6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
agg is simple to use because you can specify a list of reducers to run on a single column.

Let us do
df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
col1 col2 col3 count join
0 a x f 3 0,1,5
1 b x g 1 2
2 b y g 1 3
3 c x t 1 3
4 c y t 1 4
5 d x t 2 5,6

Another way
df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]

Aggregate over difference of levels of factor in Pandas DataFrame?

Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2

I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C

Pandas explode index

I have a df like below
a = pd.DataFrame([{'col1': ['a,b,c'], 'col2': 'x'},{'col1': ['d,b'], 'col2': 'y'}])
When I do an explode using df.explode(‘col1’), I get below results
col1 col2
a x
b x
c x
d y
b y
However, I wanted something like below,
col1 col2 col1_index
a x 1
b x 2
c x 3
d y 1
b y 2
Can someone help me?

You could do the following:
result = a.explode('col1').reset_index().rename(columns={'index' : 'col1_index'})
result['col1_index'] = result.groupby('col1_index').cumcount()
print(result)
Output
col1_index col1 col2
0 0 a x
1 1 b x
2 2 c x
3 0 d y
4 1 b y

After you explode you can simply do:
a['col1_index'] = a.groupby('col2').cumcount()+1
col1 col2 col1_index
0 a x 1
1 b x 2
2 c x 3
3 d y 1
4 b y 2

Merge pandas DataFrame on column of float values

I have two data frames that I am trying to merge.
Dataframe A:
col1 col2 sub grade
0 1 34.32 x a
1 1 34.32 x b
2 1 34.33 y c
3 2 10.14 z b
4 3 33.01 z a
Dataframe B:
col1 col2 group ID
0 1 34.32 t z
1 1 54.32 s w
2 1 34.33 r z
3 2 10.14 q z
4 3 33.01 q e
I want to merge on col1 and col2. I've been pd.merge with the following syntax:
pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.
I would like the result to be:
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b s w
2 1 54.32 s w NaN NaN
3 1 34.33 y c r z
4 2 10.14 z b q z
5 3 33.01 z a q e

You can use a little hack - multiple float columns by some constant like 100, 1000..., convert column to int, merge and last divide by constant:
N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int)
B.col2 = np.round(B.col2*N).astype(int)
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b t z
2 1 34.33 y c r z
3 2 10.14 z b q z
4 3 33.01 z a q e
5 1 54.32 NaN NaN s w

I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.
In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows.
from scipy import distance
threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)
if similarity >= threshold:
# it's a match
else:
# loop and check another row pair
This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.

Assuming that the column (col2) has n decimal numbers.
A.col2 = np.round(A.col2, decimals=n)
B.col2 = np.round(B.col2, decimals=n)
df = A.merge(B, left_on=['col1', 'col2'], right_on=['col1', 'col2'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create and fill new columns using values in rows pandas - python

Related

Adding new rows with default value based on dataframe values into dataframe

Pandas groupby concat ungrouped column into comma separated string

Aggregate over difference of levels of factor in Pandas DataFrame?

Pandas explode index

Merge pandas DataFrame on column of float values

Categories

Resources