Trimming specifc words in a dataframe

Trimming specifc words in a dataframe - python

I have a df with some trigrams (and some more ngrams) and I would like to check if the sentence starts or ends with a list of specific words and remove them from my df. For example:
import pandas as pd
df = pd.DataFrame({'Trigrams+': ['because of tuna', 'to your family', 'pay to you', 'give you in','happy birthday to you'], 'Count': [10,9,8,7,5]})
list_remove = ['of','in','to', 'a']
print(df)
Trigrams+ Count
0 because of tuna 10
1 to your family 9
2 pay to you 8
3 give you in 7
4 happy birthday to you 5
I tried using strip but in the example above the first row would return because of tun
The output should be like this:
list_remove = ['of','in','to', 'a']
Trigrams+ Count
0 because of tuna 10
1 pay to you 8
2 happy birthday to you 5
Can someone help me with that? Thanks in advance!

Try:
list_remove = ["of", "in", "to", "a"]
tmp = df["Trigrams+"].str.split()
df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))]
print(df)
Prints:
Trigrams+ Count
0 because of tuna 10
2 pay to you 8
4 happy birthday to you 5

You can try something like this:
import numpy as np
def func(x):
y = x.split()[0]
z = x.split()[-1]
if (y in list_remove) or (z in list_remove):
return np.nan
return x
df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x))
df = df.dropna().reset_index(drop=True)

Related

Explode a string with random length equally to next empty columns pandas

Let's say I've df like this..
string some_col
0 But were so TESSA tell me a little bit more t ... 10
1 15
2 14
3 Some other text xxxxxxxxxx 20
How can I split string col such that long string exploded into random lengths equally across empty cells. It should look like this after fitting.
string some_col
0 But were so TESSA tell me . 10
1 little bit more t seems like 15
2 you pretty upset 14
Reproducable
import pandas as pd
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
print(df)
I've no idea how to get even started I'm looking for execution steps so that I can implemnt on my own any refrence would be great!

You need to create groups with a non empty row and all consecutive empty rows (the group length gives the number of chunks) then use np.split_array to create n list of words:
import numpy as np
# first row --v group length --v
wrap = lambda x: [' '.join(l) for l in np.array_split(x.iloc[0].split(), len(x))]
df['string2'] = (df.groupby(df['string'].str.len().ne(0).cumsum())['string']
.apply(wrap).explode().to_numpy())
Output:
string some_col string2
0 But were so TESSA tell me a you pretty upset. 10 But were so TESSA
1 15 tell me a
2 14 you pretty upset.
3 Some other text xxxxxxxxxx 20 Some other text xxxxxxxxxx

This works in your case:
import pandas as pd
import numpy as np
from math import ceil
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14],
['Some other long string that you need..', 10], ['', 15]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
df['string'] = np.where(df['string'] == '', None, df['string'])
df.ffill(inplace=True)
df['group_id'] = df.groupby('string').cumcount() + 1
df['max_group_id'] = df.groupby('string',).transform('count')['group_id']
df['string'] = df['string'].str.split(' ')
df['string'] = df.apply(func=lambda r: r['string'][int(ceil(len(r['string'])/r['max_group_id'])*(r['group_id']-1)):
int(ceil(len(r['string'])/r['max_group_id'])*r['group_id'])], axis=1)
df.drop(columns=['group_id', 'max_group_id'], inplace=True)
print(df)
Result:
string some_col
0 [But, were, so, TESSA] 10
1 [tell, me, a, you] 15
2 [pretty, upset.] 14
3 [Some, other, long, string] 10
4 [that, you, need..] 15

You can customize number of rows you want with this code :
import pandas as pd
import random
df = pd.read_csv('text.csv')
string = df.at[0,'string']
# the number of rows you want
num_of_rows = 4
endLineLimits = random.sample(range(1, string.count(' ')), num_of_rows - 1)
count = 1
for i in range(len(string)):
if string[i] == ' ':
if count in endLineLimits:
string = string[:i] + ';' + string[i+1:]
count += 1
newStrings = string.split(';')
for i in range(len(df)):
df.at[i,'string'] = newStrings[i]
print(df)
Example result:
string some_col
0 But were so TESSA tell 10
1 me a little bit more t 15
2 seems like you pretty 14
3 upset 20

Calculate a difference in times between each pair of values in class using Pandas

I am trying to calculate the difference in the "time" column between each pair of elements having the same value in the "class" column.
This is an example of an input:
class name time
0 A Bob 2022-09-05 07:22:15
1 A Sam 2022-09-04 17:18:29
2 B Bob 2022-09-04 03:29:06
3 B Sue 2022-09-04 01:28:34
4 A Carol 2022-09-04 10:40:23
And this is an output:
class name1 name2 timeDiff
0 A Bob Carol 0 days 20:41:52
1 A Bob Sam 0 days 14:03:46
2 A Carol Sam 0 days 06:38:06
3 B Bob Sue 0 days 02:00:32
I wrote this code to solve this problem:
from itertools import combinations
df2 = pd.DataFrame(columns=['class', 'name1', 'name2', 'timeDiff'])
for c in df['class'].unique():
df_class = df[df['class'] == c]
groups = df_class.groupby(['name'])['time']
if len(df_class) > 1:
out = (pd
.concat({f'{k1} {k2}': pd.Series(data=np.abs(np.diff([g2.values[0],g1.values[0]])).astype('timedelta64[s]'), index=[f'{k1} {k2}'], name='timeDiff')
for (k1, g1), (k2, g2) in combinations(groups, 2)},
names=['name']
)
.reset_index()
)
new = out["name"].str.split(" ", n = -1, expand = True)
out["name1"]= new[0].astype(str)
out["name2"]= new[1].astype(str)
out["class"] = c
del out['level_1'], out['name']
df2 = df2.append(out, ignore_index=True)
I didn't come up with a solution without going through all the class values in a loop. However, this is very time-consuming if the input table is large. Does anyone have any solutions without using a loop?

The whole thing is a self cross join and a time difference
import pandas as pd
df = pd.DataFrame({
'class': ['A', 'A', 'B', 'B', 'A'],
'name': ['Bob', 'Sam', 'Bob', 'Sue', 'Carol'],
'time': [
pd.Timestamp('2022-09-05 07:22:15'),
pd.Timestamp('2022-09-04 17:18:29'),
pd.Timestamp('2022-09-04 03:29:06'),
pd.Timestamp('2022-09-04 01:28:34'),
pd.Timestamp('2022-09-04 10:40:23'),
]
})
rs = list()
for n, df_g in df.groupby('class'):
t_df = df_g.merge(
df_g, how='cross',
suffixes=('_1', '_2')
)
t_df = t_df[t_df['name_1'] != t_df['name_2']]
t_df = t_df.drop(['class_2'], axis=1)\
.rename({'class_1': 'class'}, axis=1).reset_index(drop=True)
t_df['timeDiff'] = abs(t_df['time_1'] - t_df['time_2'])\
.astype('timedelta64[ns]')
t_df = t_df.drop(['time_1', 'time_2'], axis=1)
rs.append(t_df)
rs_df = pd.concat(rs).reset_index(drop=True)

Check below code without Outerjoin, using Aggegrate & Itertools
from itertools import combinations
# Function to create list of names
def agg_to_list(value):
return list(list(i) for i in combinations(list(value), 2))
# Fucntion to calculate list of time & calculate differences between them
def agg_to_list_time(value):
return [ t[0] - t[1] for t in list(combinations(list(value), 2))]
# Apply aggregate functions
updated_df = df.groupby(['class']).agg({'name':agg_to_list,'time':agg_to_list_time})
# Explode DataFrame & rename column
updated_df = updated_df.explode(['name','time']).rename(columns={'time':'timediff'})
# Unpack name column in two new columns
updated_df[['name1','name2']] = pd.DataFrame(updated_df.name.tolist(), index=updated_df.index)
# Final DataFrame
print(updated_df.reset_index()[['class','name1','name2','timediff']])
Output:
class name1 name2 timediff
0 A Bob Sam 0 days 14:03:46
1 A Bob Carol 0 days 20:41:52
2 A Sam Carol 0 days 06:38:06
3 B Bob Cue 0 days 02:00:32

how to sort the data frame in the "Title "column in alphabetical order

I have the following dataframe:
and this is my code:
movies_taxes['Total Taxes'] = movies_taxes.apply(lambda x:(0.2)* x['US Gross'] + (0.18) * x['Worldwide Gross'], axis=1)
movies_taxes

Simple example:
import pandas as pd
df = pd.DataFrame({'player': ['C','B','A'], 'data': [1,2,3]})
df = df.sort_values(by ='player')
Output:
From:
player data
0 C 1
1 B 2
2 A 3
To:
player data
2 A 3
1 B 2
0 C 1

Another example:
df = pd.DataFrame({
'student': [
'monica', 'nathalia', 'anastasia', 'marina', 'ema'
],
'grade' : ['excellent', 'excellent', 'good', 'very good', 'good'
]
})
print (df)
student grade
0 monica excellent
1 nathalia excellent
2 anastasia good
3 marina very good
4 ema good
Pre pandas 0.17:
Sort by ascending student name
df.sort('student')
reverse ascending
df.sort('student', ascending=False)
Pandas 0.17+ (as mentioned in the other answers):
ascending
df.sort_values('student')
reverse ascending
df.sort_values('student', ascending=False)

This ought to do it:
>>> import pandas as pd
>>> s = pd.Series(['banana', 'apple', 'friends', '3 dog and cat', '10 old man'])
>>> import numpy as np
# We want to know which rows start with a number as well as those that don't
>>> mask = np.array([True if not any(x.startswith(str(n)) for n in range(9)) else False for x in s])
>>> s[mask]
0 banana
1 apple
2 friends
dtype: object
# Stack the sorted, non-starting-with-a-number array and the sorted, starting-with-a-number array
>>> pd.concat((s[mask].sort_values(), s[~mask].sort_values(ascending=False)))
1 apple
0 banana
2 friends
3 3 dog and cat
4 10 old man

add column with string length of another column and cumsum?

Given the following dataframe:
df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
I want to calculate the length of each string and add a cumsum column.
I am trying to do this with df.str.len("col1") but it throws an error.

Use str.len()
Ex:
import pandas as pd
df = pd.DataFrame({"col1": ["kuku", "pu", "d", "fgf"]})
df["New"] = df["col1"].str.len()
print(df)
print(df["New"].cumsum()) #cumulative sum
Output:
col1 New
0 kuku 4
1 pu 2
2 d 1
3 fgf 3
0 4
1 6
2 7
3 10
Name: New, dtype: int64

The dataframe initialization code is wrong. Try this.
>>> df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
>>> df
col1
0 kuku
1 pu
2 d
3 fgf
Alternatively, you can use map as well.
>>> df.col1.map(lambda x: len(x))
0 4
1 2
2 1
3 3
To calculate length.
>>> df['len'] = df.col1.str.len()
>>> df
col1 len
0 kuku 4
1 pu 2
2 d 1
3 fgf 3

Or
import pandas as pd
df = pd.DataFrame({ "col1" : ["kuku", "pu", "d", "fgf"]})
df['new'] = df.col1.apply(lambda x: len(x))

Your col1 argument is an unknown argument to pd.DataFrame()...
Use data as the argument name instead... Then add your new column with the length
data = {'col1': ["kuku", "pu", "d", "fgf"]}
df = pd.DataFrame(data=data)
df["col1 lenghts"] = df["col1"].str.len()
print(df)

Here is another alternative I think solved my issue:
df = pd.DataFrame({"col1": ['dilly macaroni recipe salad', 'gazpacho', 'bake crunchy onion potato', 'cool creamy easy pie watermelon', 'beef easy skillet tropical', 'chicken grilled tea thigh', 'cake dump rhubarb strawberry', 'parfaits yogurt', 'bread nut zucchini', 'la salad salmon']})
df["title_len"] = df[1].str.len()
df["cum_len"] = df["title_len"].cumsum()

Delete columns but keep specific values pandas df

I'm sure this is in SO somewhere but I can't seem to find it. I'm trying to remove or select designated columns in a pandas df. But I want to keep certain values or strings from those deleted columns.
For the df below I want to keep 'Big','Cat' in Col B,C but delete everything else.
import pandas as pd
d = ({
'A' : ['A','Keep','A','Value'],
'B' : ['Big','X','Big','Y'],
'C' : ['Cat','X','Cat','Y'],
})
df = pd.DataFrame(data=d)
If I do either the following it only selects that row.
Big = df[df['B'] == 'Big']
Cat = df[df['C'] == 'Cat']
My intended output is:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
I need something like x = df[df['B','C'] != 'Big','Cat']

Seems like you want to keep only some values and have empty string on ohters
Use np.where
keeps = ['Big', 'Cat']
df['B'] = np.where(df.B.isin(keeps), df.B, '')
df['C'] = np.where(df.C.isin(keeps), df.C, '')
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
Another solution using df.where
cols = ['B', 'C']
df[cols] = df[cols].where(df.isin(keeps)).fillna('')
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

IIUC
Update
df[['B','C']]=df[['B','C']][df[['B','C']].isin(['Big','Cat'])].fillna('')
df
Out[30]:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

You can filter on column combinations via NumPy and np.ndarray.all:
mask = (df[['B', 'C']].values != ['Big', 'Cat']).all(1)
df.loc[mask, ['B', 'C']] = ''
print(df)
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

Or this:
df[['B','C']]=df[['B','C']].apply(lambda row: row if row.tolist()==['Big','Cat'] else ['',''],axis=1)
print(df)
Output:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

Perhaps a concise version:
df.loc[df['B'] != 'Big', 'B'] = ''
df.loc[df['C'] != 'Cat', 'C'] = ''
print(df)
Output:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trimming specifc words in a dataframe - python

Try: list_remove = ["of", "in", "to", "a"] tmp = df["Trigrams+"].str.split() df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))] print(df) Prints: Trigrams+ Count 0 because of tuna 10 2 pay to you 8 4 happy birthday to you 5

You can try something like this: import numpy as np def func(x): y = x.split()[0] z = x.split()[-1] if (y in list_remove) or (z in list_remove): return np.nan return x df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x)) df = df.dropna().reset_index(drop=True)

Related

Explode a string with random length equally to next empty columns pandas

Calculate a difference in times between each pair of values in class using Pandas

how to sort the data frame in the "Title "column in alphabetical order

add column with string length of another column and cumsum?

Delete columns but keep specific values pandas df

Categories

Resources