I need to apply a list to a pandas dataframe by column. The operation to be performed is string concatenation. Being more specific:
Inputs I have:
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], columns=['Col1', 'Col2', 'Col3'])
lt = ['Prod1', 'Prod2', 'Prod3']
which results in:
>>>df
Col1 Col2 Col3
0 a b c
1 d e f
>>>lt
['Prod1', 'Prod2', 'Prod3']
moreover, the length of lt will always be equal to number of columns of df.
What I would like to have is a dataframe of this sort:
res = pd.DataFrame([['Prod1a', 'Prod2b', 'Prod3c'], ['Prod1d', 'Prod2e', 'Prod3f']],
columns=['Col1', 'Col2', 'Col3'])
which gives:
>>>res
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Until now, I've been able to solve the problem looping through rows and columns but I won't give up the idea that there's a more elegant way to solve it (maybe something like apply.
Does anyone has suggestions? Thanks!
You can perform broadcasted string concatenation:
lt + df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
You can also use numpy's np.char.add function.
df[:] = np.char.add(lt, df.values.astype(str))
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Thirdly, there is the list comprehension option.
df[:] = [[i + v for i, v in zip(lt, V)] for V in df.values.tolist()]
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Related
I have the below df:
ID Col1 Col2 Col3
1 A NB C
2 A NB C
3 NS B NC
4 NS NB NC
5 NS B NC
6 NS B C
And I'm trying to get the count of each column based on their values.
How many "A" are in the Col1
How many "B" are in the Col2
How many "C" are in the Col3
In the original df I have a lot of column and conditions.
The expected output:
Col1 Col2 Col3
TotalCount"A" TotalCount"B" TotalCount"C"
So, I'm trying to get the list of columns and iterate it but I am not getting the expected results.
I'm working with pandas in jupyternotebook
You can use df.eq here and pass a list of values to compare against.
values = ['A', 'B', 'C']
out = df.loc[:, 'Col1':].eq(values).sum()
Col1 2
Col2 3
Col3 3
dtype: int64
Extending on #Ch3ster's answer to match the expected output:
In [1382]: values = ['A', 'B', 'C']
In [1391]: res = df.filter(like='Col', axis=1).eq(values).sum().to_frame().T
In [1392]: res
Out[1392]:
Col1 Col2 Col3
0 2 3 3
I have pandas dataframe like below:
df = pd.DataFrame ({'col1': ['apple;orange;pear', 'grape;apple;kiwi;pear'], 'col2': ['apple', 'grape;kiwi']})
col1 col2
0 apple;orange;pear apple
1 grape;apple;kiwi;pear grape;kiwi
I need the data like below:
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi;pear grape;kiwi apple;pear
Does anyone know how to do that? Thanks.
In this example, the second row of col2 grape;kiwi, the sub-strings are in different position of the second row in col1 grape;apple;kiwi;pear.
[How do I create a new column in pandas from the difference of two string columns? does not work in my case.
You can use set to find the differences. As a first step, you need to convert the strings to a set.
df['col3'] = (
df.apply(lambda x: ';'.join(set(x.col1.split(';')).difference(x.col2.split(';'))),
axis=1)
)
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi grape;kiwi apple;pear
Magic of str.get_dummies
s=df.col1.str.get_dummies(';').sub(df.col2.str.get_dummies(';'),fill_value=0)
df['col3']=s.eq(1).dot(s.columns+';').str[:-1]
df
col1 col2 col3
0 apple;orange;pear apple orange;pear
1 grape;apple;kiwi;pear grape;kiwi apple;pear
how to get all column names where values in columns are 'f' or 't' into array ?
df['FTI'].value_counts()
instead of this 'FTI' i need array of returned columns. Is it possible?
Reproducible example:
df = pd.DataFrame({'col1':[1,2,3], 'col2':['f', 'f', 'f'], 'col3': ['t','t','t'], 'col4':['d','d','d']})
col1 col2 col3 col4
0 1 f t d
1 2 f t d
2 3 f t d
Such that, using eq and all:
>>> s = (df.eq('t') | df.eq('f')).all()
col1 False
col2 True
col3 True
col4 False
dtype: bool
To get the names:
>>> s[s].index.values
array(['col2', 'col3'], dtype=object)
To get the positions:
>>> np.flatnonzero(s) + 1
array([2, 3])
Yes. It is possible. Here is one way
You can get the columns like this.
cols=[]
for col in df.columns:
if df[col].str.contains('f|t').any()==True:
cols.append(col)
Then you can just use this for frequencies
f= pd.Series()
for col in cols:
f=pd.concat([f,df[col].value_counts()])
I am wondering if there a fast way to merge two pandas tables by the regular expression in python .
For example:
table A
col1 col2
1 apple_3dollars_5
2 apple_2dollar_4
1 orange_5dollar_3
1 apple_1dollar_3
table B
col1 col2
good (apple|oragne)_\dollars_5
bad .*_1dollar_.*
ok oragne_\ddollar_\d
Output:
col1 col2 col3
1 apple_3dollars_5 good
1 orange_5dollar_3 ok
1 apple_1dollar_3 bad
this is just an example, what I want is instead of merging by one col that exactly match, I want to join by some regular expression. Thank you!
First of all fix RegEx'es in the B DataFrame:
In [222]: B
Out[222]:
col1 col2
0 good (apple|oragne)_\ddollars_5
1 bad .*_1dollar_.*
2 ok orange_\ddollar_\d
Now we can prepare the following variables:
In [223]: to_repl = B.col2.values.tolist()
In [224]: vals = B.col1.values.tolist()
In [225]: to_repl
Out[225]: ['(apple|oragne)_\\ddollars_5', '.*_1dollar_.*', 'orange_\\ddollar_\\d']
In [226]: vals
Out[226]: ['good', 'bad', 'ok']
Finally we can use them in the replace function:
In [227]: A['col3'] = A['col2'].replace(to_repl, vals, regex=True)
In [228]: A
Out[228]:
col1 col2 col3
0 1 apple_3dollars_5 good
1 2 apple_2dollar_4 apple_2dollar_4
2 1 orange_5dollar_3 ok
3 1 apple_1dollar_3 bad
I took the idea from https://python.tutorialink.com/can-i-perform-a-left-join-merge-between-two-dataframes-using-regular-expressions-with-pandas/ and improved it a little so that the original data can have more than one column and now we can make a real left join (merge) with regex!
import pandas as pd
d = {'extra_colum1': ['x', 'y', 'z', 'w'],'field': ['ab', 'a', 'cd', 'e'], 'extra_colum2': ['x', 'y', 'z', 'w']}
df = pd.DataFrame(data=d)
df_dict = pd.DataFrame(['a', 'b', 'c', 'd'], columns =
['destination'])
df_dict['field'] = '.*' + df_dict['destination'] + '.*'
df_dict.columns=['destination','field']
dataframe and dict
def merge_regex(df, df_dict, how, field):
import re
df_dict = df_dict.drop_duplicates()
idx = [(i,j) for i,r in enumerate(df_dict[f'{field}']) for j,v in enumerate(df[f'{field}']) if re.match(r,v)]
df_dict_idx, df_idx = zip(*idx)
t = df_dict.iloc[list(df_dict_idx),0].reset_index(drop=True)
t1 = df.iloc[list(df_idx),df.columns.get_loc(f'{field}')].reset_index(drop=True)
df_dict_translated = pd.concat([t,t1], axis=1)
data = pd.merge(
df,
df_dict_translated,
how=f'{how}',
left_on=f'{field}',
right_on=f'{field}'
)
data = data.drop_duplicates()
return data
Similar to #MaxU, I use .replace, but I replace the column of values that you want to merge on with the regex strings that they match on. Small warning that this can cause some issues like non-unique index if your normal text matches more than one regex pattern. So using your dataframe A and #MaxU's fixed regexes for dataframe B:
A['joinCol'] = A.col2.replace(B.col2, B.col2, regex=True)
B.rename({'col2': 'joinCol'}) # the join columns should have the same name
C = A.join(B, on='joinCol')
If you want, you can then drop that join column:
C = C.drop('joinCol', axis=1)
Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A