So, i have a dataframe of the type:
Doc
String
A
abc
A
def
A
ghi
B
jkl
B
mnop
B
qrst
B
uv
What I'm trying to do is to merge/collpase rows according to a two conditions:
they must be from the same document
they should be merged together up to a max length
I have
So that, for example if I will get max_len == 6:
Doc
String
A
abcdef
A
defghi
B
jkl
B
mnop
B
qrstuv
he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.
I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:
def group(col, max_len=6):
groups = []
group = acc = 0
for length in col.values:
acc += length
if max_len < acc:
group, acc = group + 1, length
groups.append(group)
return groups
groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)
The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.
Result for the sample:
Doc String
0 A abcdef
1 A ghi
2 B jkl
3 B mnop
4 B qrstuv
I have not tried to run this code so there might be bugs, but essentially:
uniques = list(set(df['Doc'].values))
new_df = pd.DataFrame(index=uniques, columns=df.columns)
for doc in uniques:
x_df = df.loc[df['Doc']==doc, 'String']
concatenated = sum(x_df['String'].values)[:max_length]
new_df.loc[doc, 'String'] = concatenated
Related
I want to strip words, specified in a list, from stings of a pandas column, and build another column with them.
I have this example inspired from question python pandas if column string contains word flag
listing = ['test', 'big']
df = pd.DataFrame({'Title':['small test','huge Test', 'big','nothing', np.nan, 'a', 'b']})
df['Test_Flag'] = np.where(df['Title'].str.contains('|'.join(listing), case=False,
na=False), 'T', '')
print (df)
Title Test_Flag
0 small test T
1 huge Test T
2 big T
3 nothing
4 NaN
5 a
6 b
But, what if I want to put instead of "T", the actual word in the list that has been found?
So, having a result:
Title Test_Flag
0 small test test
1 huge Test test
2 big big
3 nothing
4 NaN
5 a
6 b
Using the .apply method with a custom function should give you what you are looking for
import pandas as pd
import numpy as np
# Define the listing list with the words you want to extract
listing = ['test', 'big']
# Define the DataFrame
df = pd.DataFrame({'Title':['small test','huge Test', 'big','nothing', np.nan, 'a', 'b']})
# Define the function which takes a string and a list of words to extract as inputs
def listing_splitter(text, listing):
# Try except to handle np.nans in input
try:
# Extract the list of flags
flags = [l for l in listing if l in text.lower()]
# If any flags were extracted then return the list
if flags:
return flags
# Otherwise return np.nan
else:
return np.nan
except AttributeError:
return np.nan
# Apply the function to the column
df['Test_Flag'] = df['Title'].apply(lambda x: listing_splitter(x, listing))
df
Output:
Title Test_Flag
0 small test ['test']
1 huge Test ['test']
2 big ['big']
3 nothing NaN
4 NaN NaN
5 a NaN
6 b NaN
7 smalltest ['test']
I have a dataframe
df = pd.DataFrame({'col1': [1,2,1,2], 'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']})
I want to:
Split col2 on the ' ' where col1==1
Split on the '-' where col1==2
Append this data to 3 new columns: (col20, col21, col22)
Ideally the code would look like this:
subdf=df.loc[df['col1']==1]
#list of columns to use
col_list=['col20', 'col21', 'col22']
#append to dataframe new columns from split function
subdf[col_list]=(subdf.col2.str.split(' ', 2, expand=True)
however this hasn't worked.
I have tried using merge and join, however:
join doesn't work if the columns are already populated
merge doesn't work if they aren't.
I have also tried:
#subset dataframes
subdf=df.loc[df['col1']==1]
subdf2=df.loc[df['col1']==2]
#trying the join method, only works if columns aren't already present
subdf.join(subdf.col2.str.split(' ', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
#merge doesn't work if columns aren't present
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
subdf2
the error messages when I run it:
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'})
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
EDIT givin information after mark's comment on regex
My original col1 was actually the regex combination I had used to extract col2 from some strings.
#the combination I used to extract the col2
combinations= ['(\d+)[-](\d+)[-](\d+)[-](\d+)', '(\d+)[-](\d+)[-](\d+)'... ]
here is the original dataframe
col1 col2
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31
I then created a dictionary that connected every combination to what the split values of col2 represented:
filtermap={'(\d+)[-](\d+)[-](\w+)(\d+)': 'thickness temperature sample', '(\d+)[-](\d+)[-](\d+)[-](\d+)': 'thickness temperature width height' }
with this filter I wanted to:
Subset the dattaframe based on regex combinations
use split on col2 to find the values corresponding to the combination using the filtermap (thickness temperature..)
add these values to the new columns on the dataframe
col1 col2 thickness temperature width length sample
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10 350 300 50 10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31 150 180 G31
since you mentioned regex maybe you know of a way to do this directly ?
EDIT 2; input-output
in the input there are strings like so:
'this is the first example string 350-300-50-10 ',
'this is the second example string 150-180-G31'
formats that are:
number-number-number-number(350-300-50-10 ) have this orded information in them: thickness(350)-temperature(300)-width(50)-length(10)
number-number-letternumber (150-180-G31 ) have this ordered information in them: thickness-temperature-sample
desired output:
col2, thickness, temperature, width, length, sample
350-300-50-10 350 300 50 10 None
150-180-G31 150 180 None None G31
I used eg:
re.search('(\d+)[-](\d+)[-](\d+)[-](\d+)'))
to find the col2 in the strings
You can use np.where to simplify this problem.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1,2,1,2],
'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']
})
temp = np.where(df['col1'] == 1, #a boolean array/series indicating where the values are equal to 1.
df['col2'].str.split(' '), #Use the output of this if True
df['col2'].str.split('-') #Else use this.
)
temp_df = pd.DataFrame(temp.tolist()) #create a new dataframe with the columns we need
#Output:
0 1 2
0 aa bb cc
1 ee ff gg
2 hh ii kk
3 ll mm nn
Now just assign the result back to the original df. You can use a concat or join, but a simple assignment suffices as well.
df[[f'col2_{i}' for i in temp_df.columns]] = temp_df
print(df)
col1 col2 col2_0 col2_1 col2_2
0 1 aa bb cc aa bb cc
1 2 ee-ff-gg ee ff gg
2 1 hh ii kk hh ii kk
3 2 ll-mm-nn ll mm nn
EDIT: To address more than two conditional splits
If you need more than two conditions, np.where was only designed to work for a binary selection. You can Opt for a "custom" approach that works with as many splits as you like here.
splits = [ ' ', '-', '---']
all_splits = pd.DataFrame({s:df['col2'].str.split(s).values for s in splits})
#Output:
- ---
0 [aa, bb, cc] [aa bb cc] [aa bb cc]
1 [ee-ff-gg] [ee, ff, gg] [ee-ff-gg]
2 [hh, ii, kk] [hh ii kk] [hh ii kk]
3 [ll-mm-nn] [ll, mm, nn] [ll-mm-nn]
First we split df['col2'] on all splits, without expanding. Now, it's just a question of selecting the correct list based on the value of df['col1']
We can use numpy's advanced indexing for this.
temp = all_splits.values[np.arange(len(df)), df['col1']-1]
After this point, the steps should be same as above, starting with creating temp_df
You are pretty close. To generate a column based on some condition, where is often handy, see code below,
col2_exp1 = df.col2.str.split(' ',expand=True)
col2_exp2 = df.col2.str.split('-',expand=True)
col2_combine = (col2_exp1.where(df.col1.eq(1),col2_exp2)
.rename(columns=lambda x:f'col2{x}'))
Finally,
df.join(col2_combine)
Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4
I am trying to figure out how to solve following problem:
I have pandas dataframe that contains some strings that are delimited with ','. My goal is to find these and replace them with new lines so that there are no more delimiters within the dataframe. For example a cell contains 'hi,there' and I would like it to become 'hi' and 'there' so there will be two lines instead of one at the end.
This should be applied until there are no delimiters within original dataframe so in case there are two words ('hi,there' and 'whats,up,there') in one line in two different columns, it becomes 6 lines instead of original one (cartesian product). The same should be applied for all lines within dataframe.
Here below is code demonstrating the original dataframe (a) and the result I would like to end with:
a = pd.DataFrame([['Hi,there', 'fv', 'whats,up,there'],['dfasd', 'vfgfh', 'kutfddx'],['fdfa', 'uyg', 'iutyfrd']], columns = ['a', 'b', 'c'])
Output:
Desired output here:
So far I managed to copy the lines so many times I need for this purpose but I cannot figure out how to replace the delimited words with what I want:
ndf = pd.DataFrame([])
for i in a.values:
n = 1
for j in i:
if ',' in j:
n = n*len(j.split(','))
ndf = ndf.append([i]*n, ignore_index=False)
This produces:
Any idea how to proceed? I can only use pandas and numpy for this but I am convinced it should suffice.
First I split by coma words then use stack() function
a_list = a.apply(lambda x : x.str.split(','))
for i in a_list:
tmp = pd.DataFrame.from_records(a_list[i].tolist()).stack().reset_index(level=1, drop=True).rename('new_{}'.format(i))
a = a.drop(i, axis=1).join(tmp)
a = a.reset_index(drop=True)
Result:
>>> a
new_a new_c new_b
0 Hi whats fv
1 Hi up fv
2 Hi there fv
3 there whats fv
4 there up fv
5 there there fv
6 dfasd kutfddx vfgfh
7 fdfa iutyfrd uyg
Update
To handle missing values (np.nan and None) first I convert it to string then do the same as for normal data and then I replace NaN string to np.nan.
Let's insert some missing values
import numpy as np
a['a'].loc[0] = np.nan
a['b'].loc[1] = None
# a b c
# 0 NaN fv whats,up,there
# 1 dfasd None kutfddx
# 2 fdfa uyg iutyfrd
a.fillna('NaN', inplace=True) # some string
#
# insert the code above (with for loop)
#
a.replace('NaN', np.nan, inplace=True)
# new_a new_b new_c
# 0 NaN fv whats
# 1 NaN fv up
# 2 NaN fv there
# 3 dfasd NaN kutfddx
# 4 fdfa uyg iutyfrd
IIUC, you can agg with itertools.product
import itertools
df.agg(lambda r: pd.Series(list(itertools.product(*[r.a.split(',')], *[r.b.split(',')], *[r.c.split(',')]))), 1).stack().apply(pd.Series).reset_index(drop=True)
0 1 2
0 Hi fv whats
1 Hi fv up
2 Hi fv there
3 there fv whats
4 there fv up
5 there fv there
6 dfasd vfgfh kutfddx
7 fdfa uyg iutyfrd
I have a data frame in pandas having two columns where each row is a list of strings, how would it be possible to check if there is word match(es) in these two columns on a unique row(flag column is the desired output)
A B flag
hello,hi,bye bye, also 1
but, as well see, pandas 0
I have tried
df['A'].str.contains(df['B'])
but I got this error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
You can convert each value to separately words by split and sets and check intersection by &, then convert values to boolean - empty sets are converted to Falses and last convert it to ints - Falses are 0s and Trues are 1s.
zipped = zip(df['A'], df['B'])
df['flag'] = [int(bool(set(a.split(',')) & set(b.split(',')))) for a, b in zipped]
print (df)
A B flag
0 hello,hi,bye bye,also 1
1 but,as well see,pandas 0
Similar solution:
df['flag'] = np.array([set(a.split(',')) & set(b.split(',')) for a, b in zipped]).astype(bool).astype(int)
print (df)
A B flag
0 hello,hi,bye bye, also 1
1 but,as well see, pandas 0
EDIT: There is possible some whitespaces before ,, so add map with str.strip and also remove empty strings with filter:
df = pd.DataFrame({'A': ['hello,hi,bye', 'but,,,as well'],
'B': ['bye ,,, also', 'see,,,pandas']})
print (df)
A B
0 hello,hi,bye bye ,,, also
1 but,,,as well see,,,pandas
zipped = zip(df['A'], df['B'])
def setify(x):
return set(map(str.strip, filter(None, x.split(','))))
df['flag'] = [int(bool(setify(a) & setify(b))) for a, b in zipped]
print (df)
A B flag
0 hello,hi,bye bye ,,, also 1
1 but,,,as well see,,,pandas 0