I have the following dataframe:
A B
Tenor
1 15.1726 0.138628
2 15.1726 0.147002
3 15.1726 0.155376
4 15.1726 0.163749
5 15.1726 0.172123
I want to be able to create another column that has a string by concatenating the previous columns, including the index. For instance, first row of this new columns would be: XXXX1XXXX15.1726XXXX0.138628
How can I do that in Pandas? If I try to use df[ColumnName] in the string formula Pandas will always bring the index, which will mess up my string.
you can use apply
df['NewCol'] = df.apply(lambda x: "XXXX" + str(x.name) + "XXXX" + str(x.A) + "XXXX" + str(x.B), axis=1)
also, a bit shorter, from #Abdou
df['joined'] = df.apply(lambda x: 'XXXX'+'XXXX'.join(map(str,x)),axis=1)
You can try something like this:
df['newColumn'] = [("XXXX"+str(id)+"XXXX" +str(entry.A) + "XXXX" +str(entry.B)) for id,entry in df.iterrows()]
I thought this was interesting
df.reset_index().applymap('XXXX{}'.format).sum(1)
0 XXXX1XXXX15.1726XXXX0.138628
1 XXXX2XXXX15.1726XXXX0.147002
2 XXXX3XXXX15.1726XXXX0.155376
3 XXXX4XXXX15.1726XXXX0.163749
4 XXXX5XXXX15.1726XXXX0.172123
dtype: object
Related
I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448
I know i can do like below if we are checking only two columns together.
df['flag'] = df['a_id'].isin(df['b_id'])
where df is a data frame, and a_id and b_id are two columns of the data frame. It will return True or False value based on the match. But i need to compare multiple columns together.
For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip columns. I want to compare like below,
a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']
df['flag'] = a_key.isin(b_key)
Somehow the above code is always returning False value. The output should be like below,
First row flag will be True because there is a match.
a_key becomes 2a10 this is match with last row of b_key (2a10)
You were going in the right direction, just use:
a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)
a_key.isin(b_key)
Mine is giving below results:
0 True
1 False
2 False
You can use isin with DataFrame as value, but as per the docs:
If values is a DataFrame, then both the index and column labels must
match
So this should work:
# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])
# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)
# Get actual rows with boolean indexing
df_a.loc[matched]
# ... or add boolean flag to dataframe
df['flag'] = matched
Here's one approach using DataFrame.merge, pandas.concat and testing for duplicated values:
df_merged = df.merge(df,
left_on=['a_id', 'a_region', 'a_ip'],
right_on=['b_id', 'b_region', 'b_ip'],
suffixes=('', '_y'))
df['flag'] = pd.concat([df, df_merged[df.columns]]).duplicated(keep=False)[:len(df)].values
[out]
a_id a_region a_ip b_id b_region b_ip flag
0 2 a 10 3222222 sssss 22222 True
1 22222 bcccc 10000 43333 ddddd 11111 False
2 33333 acccc 120000 2 a 10 False
I have dataframe made from csv in which missing data is represented by ? symbol. I want to check how many rows there are in which ? occurs with number of occurrence.
So far i made this but it shows number of all rows, not only that ones in which ? occurs.
print(sum([True for idx,row in df.iterrows() if
any(row.str.contains('[?]'))]))
You can use apply + str.contains, assuming all your columns are strings.
c = np.sum(df.apply(lambda x: x.str.contains('\?')).values)
If you need to select string columns only, use select_dtypes -
i = df.select_dtypes(exclude=['number']).apply(lambda x: x.str.contains('\?'))
c = np.sum(i.values)
Alternatively, to find the number of rows containing ? in them, use
c = df.apply(lambda x: x.str.contains('\?')).any(axis=1).sum()
Demo -
df
A B
0 aaa ?xyz
1 bbb que!?
2 ? ddd
3 foo? fff
df.apply(lambda x: x.str.contains('\?')).any(1).sum()
4
I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])
Hi is there a way to get a substring of a column based on another column?
import pandas as pd
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x
digit name
0 2 bernard
1 3 brenden
2 3 bern
What i would expect is something like:
for row in x.itertuples():
print row[2][:row[1]]
be
bre
ber
where the result is the substring of name based on digit.
I know if I really want to I can create a list based on the itertuples function but does not seem right and also, I always try to create a vectorized method.
Appreciate any feedback.
Use apply with axis=1 for row-wise with a lambda so you access each column for slicing:
In [68]:
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x.apply(lambda x: x['name'][:x['digit']], axis=1)
Out[68]:
0 be
1 bre
2 ber
dtype: object