extract semicolon separated value from pandas df column [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I need to extract a specific value from pandas df column. The data looks like this:
row my_column
1 artid=delish.recipe.45064;artid=delish_recipe_45064;avb=83.3;role=4;data=list;prf=i
2 ab=px_d_1200;ab=2;ab=t_d_o_1000;artid=delish.recipe.23;artid=delish;role=1;pdf=true
3 dat=_o_1000;artid=delish.recipe.23;ar;role=56;passing=true;points001
The data is not consistent, but separated by a comma and I need to extract role=x.
I separated the data by a semicolon. And can loop trough the values to fetch the roles, but was wondering if there is a more elegant way to solve it.
Desired output:
row my_column
1 role=4
2 role=1
3 role=56
Thank you.

You can use str.extract and pass the required pattern within parentheses.
df['my_column'] = df['my_column'].str.extract('(role=\d+)')
row my_column
0 1 role=4
1 2 role=1
2 3 role=56

This should work:
def get_role(x):
l=x.split(sep=';')
t=[i for i in l if i[:4]=='role')][0]
return t
df['my_column']=[i for i in map(lambda y: get_role(y), df['my_column'])]

Related

How to convert all numeric values to floats when treated as string [duplicate]

This question already has answers here:
Change column type in pandas
(16 answers)
Closed 7 months ago.
I have a df like this:
label data start
37 1 Ses01M_impro04_F018 [145.2100-153.0500]: We're... 145.21000
38 2 Ses01M_impro04_M019 [148.3800-151.8400]: Well,... 148.38000
39 2 M: [BREATHING] BREATHING
40 1 Ses01M_impro04_M020 [159.7700-161.8600]: I'm n... 159.77000
I parsed out the start column to get the starting timestamp for each row using this code:
df['start'] = df.data.str.split().str[1].str[1:-2].str.split('-').str[0]
I want to convert df.start into floats because they are treated as string right now. However, I can't simply to .astype(float) because of the actual string BREATHING in row 39.
I'd like to just drop the row containing alphabet characters (row 39). I do not know how to do this because at this point, all values in df.start are type string, so I can't filter with something like isnumeric(). How do I do this?
Pasting a skeletal code. You can modify and use it
if a.isnumeric():
newa=to_numeric(a)
else:
newa=a

Removing special characters from column headers [duplicate]

This question already has answers here:
How to flatten a hierarchical index in columns
(19 answers)
Closed 1 year ago.
I used to_flat_index() to flatten columns and ended up with column names like ('Method', 'sum'). I am trying to remove the special characters from these. But when I try to remove them, it changes all the column names to nan
function attempted:
df_pred.columns = df_pred.columns.str.replace("[(,),']", '')
Expected outcome: MethodSum
It seems your columns are multi-indexed because your use to_flat_index.
>>> df
bar baz foo qux
one two one two one two one two
0 0.713825 0.015553 0.036683 0.388443 0.729509 0.699883 0.125998 0.407517
1 0.820843 0.259039 0.217209 0.021479 0.845530 0.112166 0.219814 0.527205
2 0.734660 0.931206 0.651559 0.337565 0.422514 0.873403 0.979258 0.269594
3 0.314323 0.857317 0.222574 0.811631 0.313495 0.315072 0.354784 0.394564
4 0.672068 0.658103 0.402914 0.430545 0.879331 0.015605 0.086048 0.918678
Try:
>>> df.columns.to_flat_index().map(''.join)
Index(['barone', 'bartwo', 'bazone', 'baztwo',
'fooone', 'footwo', 'quxone', 'quxtwo'],
dtype='object')

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

pandas contains exact string from a list [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I have 2 dataframes df1 and df2.
I would like to get all rows in df1 that has exact string match in column B of df2
This is df1:
df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']}
df1=pd.DataFrame(df1)
This is df2:
df2={"columnB":['apple','cherry']}
df2=pd.DataFrame(df2)
Below code output incorrect result:
df1[df1['columnA'].str.contains('|'.join(df2['columnB'].values))]
Pineapple is not supposed to appear as this is not exact match.
How can i get result like this:
Without actual reproducible code it's harder to help you, but I think this should work:
words = [rf'\b{string}\b' for string in df2.columnB]
df1[df1['columnA'].str.contains('|'.join(words))]
df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']}
df1=pd.DataFrame(df1)
df2={"columnB":['apple','cherry']}
df2=pd.DataFrame(df2)
Larger way of doing it ,but correct and simpler
list1=[]
for i in range(0,len(df1)):
for j in range(0,len(df2)):
if(df2["columnB"][j] in df1["columnA"][i]):
list1.append(i)
break
df=df1.loc[list1]
Answer
ColumnA
0 apple,cherry
3 cherry, pear, lemon
You were very close, but you will need to apply the word-operator of regex:
df1[df1['columnA'].str.contains("\b(" + '|'.join(df2['columnB'].values) + ")\b")]
This will look for the complete words.

Missing first row while reading from file - Python Pandas [duplicate]

This question already has answers here:
Prevent pandas read_csv treating first row as header of column names
(4 answers)
Closed 3 years ago.
I have a file which has coordinates like
1 1
1 2
1 3
1 4
1 5
and so on
There are no zeros in them.I tried using comma and tab as a delimiter and still stuck in same problem.
Now when I printed the output to screen I saw something very weird. It looks like it is missing the very first line.
The output after running pa.read_csv('co-or.txt',sep='\t') is as follows
1 1
0 1 2
1 1 3
2 1 4
3 1 5
and so on..
I am not sure if I am missing any arguments in this.
Also when I tried to convert that to numpy array using np.array, It is again missing the first line and hence the first element [1 1]
df = pd.read_csv('data.csv', header=None)
You need to specifcy header=None otherwise pandas takes the first row as the header.
If you want to give them a meaningful name you can use the names as such:
df = pd.read_csv('data.csv', header=None, names=['foo','bar'])
Spend some time with pandas Documentation as well to get yourself familiar with their API. This one is for read_csv
You can try this:
file = open('file.dat','r')
lines = file.readlines()
file.close()
and it does work.

Categories