I have two data frames in my problem.
df1
ID Value
1 A
2 B
3 C
df2:
ID F_ID S_ID
1 2 3
2 3 1
3 1 2
I want to create a column next to each ID column that will store the values looked up from df1. The output should look like this :
ID ID_Value F_ID F_ID_Value S_ID S_ID_Value
1 A 2 B 3 C
2 B 3 C 1 A
3 C 1 A 2 B
Basically looking up from df1 and creating a new column to store these values.
you can use map on each column of df2 with the value of df1.
s = df1.set_index('ID')['Value']
for col in df2.columns:
df2[f'{col}_value'] = df2[col].map(s)
print (df2)
ID F_ID S_ID ID_value F_ID_value S_ID_value
0 1 2 3 A B C
1 2 3 1 B C A
2 3 1 2 C A B
or with apply and concat
df_ = pd.concat([df2, df2.apply(lambda x: x.map(s)).add_prefix('_value')], axis=1)
df_ = df_.reindex(sorted(df_.columns), axis=1)
If order is important (I realised not in comments) is necessary use DataFrame.insert with enumerate and some maths:
s = df1.set_index('ID')['Value']
for i, col in enumerate(df2.columns, 1):
df2.insert(i * 2 - 1, f'{col}_value', df2[col].map(s))
print (df2)
ID ID_value F_ID F_ID_value S_ID S_ID_value
0 1 A 2 B 3 C
1 2 B 3 C 1 A
2 3 C 1 A 2 B
Related
I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need
I have a data set as such:
x = {'column1': ['a','a','b','b','b','c','c','c','d'],
'column2': [1,0,1,1,0,1,1,0,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
How would i extract only data from column two that have value of one (like this):
x = {'column1': ['a','b','b','c','c','d'],
'column2': [1,1,1,1,1,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
Also how would i count the number of 1's for each values in column 1 and make a new column and insert that information for respective indexes in coulmn_1(for example, how many 1's do index value a in column_1 have?).So it turns dataframe into this format:
x = {'column1': ['a','b','b','c','c','d'],
'column2': [1,1,1,1,1,1],
'column3': [1,2,2,2,2,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2','column3'])
print (df)
First question:
df[df.column2==1].reset_index(drop=True)
will give you
column1 column2
0 a 1
1 b 1
2 b 1
3 c 1
4 c 1
5 d 1
Second question:
df['column3'] = df.groupby('column1').transform(len)
will give you
column1 column2 column3
0 a 1 1
1 b 1 2
2 b 1 2
3 c 1 2
4 c 1 2
5 d 1 1
Use boolean indexing with Series.eq for compare like == and then Series.map with Series.value_counts:
df = df[df['column2'].eq(1)]
df['column3'] = df['column1'].map(df['column1'].value_counts())
Alternative with GroupBy.transform and GroupBy.size:
df['column3'] = df.groupby('column1')['column1'].transform('size')
print (df)
column1 column2 column3
0 a 1 1
2 b 1 2
3 b 1 2
5 c 1 2
6 c 1 2
8 d 1 1
Last for default index use DataFrame.reset_index with drop=True:
df = df.reset_index(drop=True)
print (df)
column1 column2 column3
0 a 1 1
1 b 1 2
2 b 1 2
3 c 1 2
4 c 1 2
5 d 1 1
How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)
I am trying to add an incremental value to a column based on specific values of another column in a dataframe. So that...
col A col B
A 0
B 1
C 2
A 3
A 4
B 5
Would become something like this:
col A col B
A 1
B 2
C 3
A 1
A 1
B 2
C 3
Have tried using groupby function but cant really get my head around setting incremental values on column B.
Any thoughts?
Thanks
I think need factorize:
df['col B'] = pd.factorize(df['col A'])[0] + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2
Another solution:
df['col B'] = pd.Categorical(df['col A']).codes + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2
We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4