Conditionally populate new columns using split on another column - python

I have a dataframe
df = pd.DataFrame({'col1': [1,2,1,2], 'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']})
I want to:
Split col2 on the ' ' where col1==1
Split on the '-' where col1==2
Append this data to 3 new columns: (col20, col21, col22)
Ideally the code would look like this:
subdf=df.loc[df['col1']==1]
#list of columns to use
col_list=['col20', 'col21', 'col22']
#append to dataframe new columns from split function
subdf[col_list]=(subdf.col2.str.split(' ', 2, expand=True)
however this hasn't worked.
I have tried using merge and join, however:
join doesn't work if the columns are already populated
merge doesn't work if they aren't.
I have also tried:
#subset dataframes
subdf=df.loc[df['col1']==1]
subdf2=df.loc[df['col1']==2]
#trying the join method, only works if columns aren't already present
subdf.join(subdf.col2.str.split(' ', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
#merge doesn't work if columns aren't present
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
subdf2
the error messages when I run it:
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'})
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
EDIT givin information after mark's comment on regex
My original col1 was actually the regex combination I had used to extract col2 from some strings.
#the combination I used to extract the col2
combinations= ['(\d+)[-](\d+)[-](\d+)[-](\d+)', '(\d+)[-](\d+)[-](\d+)'... ]
here is the original dataframe
col1 col2
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31
I then created a dictionary that connected every combination to what the split values of col2 represented:
filtermap={'(\d+)[-](\d+)[-](\w+)(\d+)': 'thickness temperature sample', '(\d+)[-](\d+)[-](\d+)[-](\d+)': 'thickness temperature width height' }
with this filter I wanted to:
Subset the dattaframe based on regex combinations
use split on col2 to find the values corresponding to the combination using the filtermap (thickness temperature..)
add these values to the new columns on the dataframe
col1 col2 thickness temperature width length sample
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10 350 300 50 10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31 150 180 G31
since you mentioned regex maybe you know of a way to do this directly ?
EDIT 2; input-output
in the input there are strings like so:
'this is the first example string 350-300-50-10 ',
'this is the second example string 150-180-G31'
formats that are:
number-number-number-number(350-300-50-10 ) have this orded information in them: thickness(350)-temperature(300)-width(50)-length(10)
number-number-letternumber (150-180-G31 ) have this ordered information in them: thickness-temperature-sample
desired output:
col2, thickness, temperature, width, length, sample
350-300-50-10 350 300 50 10 None
150-180-G31 150 180 None None G31
I used eg:
re.search('(\d+)[-](\d+)[-](\d+)[-](\d+)'))
to find the col2 in the strings

You can use np.where to simplify this problem.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1,2,1,2],
'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']
})
temp = np.where(df['col1'] == 1, #a boolean array/series indicating where the values are equal to 1.
df['col2'].str.split(' '), #Use the output of this if True
df['col2'].str.split('-') #Else use this.
)
temp_df = pd.DataFrame(temp.tolist()) #create a new dataframe with the columns we need
#Output:
0 1 2
0 aa bb cc
1 ee ff gg
2 hh ii kk
3 ll mm nn
Now just assign the result back to the original df. You can use a concat or join, but a simple assignment suffices as well.
df[[f'col2_{i}' for i in temp_df.columns]] = temp_df
print(df)
col1 col2 col2_0 col2_1 col2_2
0 1 aa bb cc aa bb cc
1 2 ee-ff-gg ee ff gg
2 1 hh ii kk hh ii kk
3 2 ll-mm-nn ll mm nn
EDIT: To address more than two conditional splits
If you need more than two conditions, np.where was only designed to work for a binary selection. You can Opt for a "custom" approach that works with as many splits as you like here.
splits = [ ' ', '-', '---']
all_splits = pd.DataFrame({s:df['col2'].str.split(s).values for s in splits})
#Output:
- ---
0 [aa, bb, cc] [aa bb cc] [aa bb cc]
1 [ee-ff-gg] [ee, ff, gg] [ee-ff-gg]
2 [hh, ii, kk] [hh ii kk] [hh ii kk]
3 [ll-mm-nn] [ll, mm, nn] [ll-mm-nn]
First we split df['col2'] on all splits, without expanding. Now, it's just a question of selecting the correct list based on the value of df['col1']
We can use numpy's advanced indexing for this.
temp = all_splits.values[np.arange(len(df)), df['col1']-1]
After this point, the steps should be same as above, starting with creating temp_df

You are pretty close. To generate a column based on some condition, where is often handy, see code below,
col2_exp1 = df.col2.str.split(' ',expand=True)
col2_exp2 = df.col2.str.split('-',expand=True)
col2_combine = (col2_exp1.where(df.col1.eq(1),col2_exp2)
.rename(columns=lambda x:f'col2{x}'))
Finally,
df.join(col2_combine)

Related

Collapse together pandas row that respect a list of conditions

So, i have a dataframe of the type:
Doc
String
A
abc
A
def
A
ghi
B
jkl
B
mnop
B
qrst
B
uv
What I'm trying to do is to merge/collpase rows according to a two conditions:
they must be from the same document
they should be merged together up to a max length
I have
So that, for example if I will get max_len == 6:
Doc
String
A
abcdef
A
defghi
B
jkl
B
mnop
B
qrstuv
he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.
I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:
def group(col, max_len=6):
groups = []
group = acc = 0
for length in col.values:
acc += length
if max_len < acc:
group, acc = group + 1, length
groups.append(group)
return groups
groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)
The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.
Result for the sample:
Doc String
0 A abcdef
1 A ghi
2 B jkl
3 B mnop
4 B qrstuv
I have not tried to run this code so there might be bugs, but essentially:
uniques = list(set(df['Doc'].values))
new_df = pd.DataFrame(index=uniques, columns=df.columns)
for doc in uniques:
x_df = df.loc[df['Doc']==doc, 'String']
concatenated = sum(x_df['String'].values)[:max_length]
new_df.loc[doc, 'String'] = concatenated

Formatting specific rows in Dash Datatable with %, $, etc

I am using the Dash Datatable code to create the table in Plotly/Python. I would like to format the various rows in value column. For example, I would like to format Row[1] with $ sign, while Row[2] with %. TIA
#Row KPI Value
0 AA 1
1 BB $230.
2 CC 54%
3 DD 5.6.
4 EE $54000
Table
I have been looking into this issue as well. unfortunately I didn't succeed with any thing built-in either. If you do in the future, please let me know.
However, the solution that I implemented was the following function to easily change the format of DataFrame elements to strings with the formatting I would like:
def dt_formatter(df:pd.DataFrame,
formatter:str,
slicer:pd.IndexSlice=None)->pd.DataFrame:
if not slicer:
for col in df.columns:
df[col] = df[col].apply(formatter.format,axis = 0)
return df
else:
dfs = df.loc[slicer].copy()
for col in dfs.columns:
dfs[col] = dfs[col].apply(formatter.format,axis = 0)
df.loc[slicer] = dfs
return df
and the using your regular slicing / filtering with your base dataframe df. Assuming your base df looks like this:
>>> df
#Row KPI Value
0 AA 1
1 BB 230
2 CC 54
3 DD 5.6
4 EE 54000
>>> df = dt_formatter(df, '{:.0%}', pd.IndexSlice[df['#Row'] == 1,'Value')
>>> df
#Row KPI Value
0 AA 1
1 BB 230%
2 CC 54
3 DD 5.6
4 EE 54000
using a different slicer and different formatting string, you could "build" your DataFrame using such a helper function.

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

Group data by columns and row values

I have the following dataframe df:
df =
REGION GROUP_1 GROUP_2 GROUP_3
Reg1 AAA BBB AAA
Reg2 BBB AAA CCC
Reg1 BBB CCC CCC
I need to count the number of unique occurences of the values of GROUP_1, GROUP_2 and GROUP_3 grouped per REGION (the quantity of GROUP_ columns is 50 in my real dataset).
For the above example, the result should be the following:
result =
REGION COUNT_AAA COUNT_BBB COUNT_CCC
Reg1 1 2 1
Reg2 1 1 1
This is my code:
df = (pd.melt(df, id_vars=['REGION'], value_name='GROUP')
.drop('variable', axis=1).drop_duplicates()
.groupby(['REGION', 'GROUP']).agg({'GROUP' : 'count'})
.reset_index())
The problem is that it takes too much time for 1Gb of data. I cannot even check the result on the whole dataset because of very long calculation time. In my opinion, there is something wrong in the code or it can be simplified.
You could start off with dropping duplicated values present among the GROUP_X columns. Then with the help of lreshape, consolidate these into a single GROUP column.
Perform groupby by making REGION as the grouped key and compute value_counts to get respective unique counts present in the GROUP column.
Finally, unstack to make the multi-index series to a dataframe and add an optional prefix to the column headers obtained.
slow approach:
(pd.lreshape(df.apply(lambda x: x.drop_duplicates(), 1),
{"GROUP": df.filter(like='GROUP').columns})
.groupby('REGION')['GROUP'].value_counts().unstack().add_prefix('COUNT_'))
To obtain a flat DF:
(pd.lreshape(df.apply(lambda x: x.drop_duplicates(), 1),
{"GROUP": df.filter(like='GROUP').columns}).groupby('REGION')['GROUP'].value_counts()
.unstack().add_prefix('COUNT_').rename_axis(None, 1).reset_index())
slightly fast approach:
With the help of MultiIndex.from_arrays, we could compute unique rows too.
midx = pd.MultiIndex.from_arrays(df.filter(like='GROUP').values, names=df.REGION.values)
d = pd.DataFrame(midx.levels, midx.names)
d.stack().groupby(level=0).value_counts().unstack().rename_axis('REGION')
faster approach:
A Faster way would be to create the unique row values using pd.unique(faster than np.unique as it does not perform sort operation after finding the unique elements) while iterating through the array corresponding to GROUP_X column. This takes the major chunk of the time. Then, stack, groupby, value_counts and finally unstack it back.
d = pd.DataFrame([pd.unique(i) for i in df.filter(like='GROUP').values], df.REGION)
d.stack().groupby(level=0).value_counts(sort=False).unstack()
set_index
value_counts
notnull converts 1s and 2s to True and np.nan to False
groupby + sum
df.set_index('REGION').apply(
pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int)
AAA BBB CCC
REGION
Reg1 1 2 1
Reg2 1 1 1
Even Faster
val = df.filter(like='GROUP').values
reg = df.REGION.values.repeat(val.shape[1])
idx = df.index.values.repeat(val.shape[1])
grp = val.ravel()
pd.Series({(i, r, g): 1 for i, r, g in zip(idx, reg, grp)}).groupby(level=[1, 2]).sum().unstack()
Faster Still
from collections import Counter
val = df.filter(like='GROUP').values
reg = df.REGION.values.repeat(val.shape[1])
idx = df.index.values.repeat(val.shape[1])
grp = val.ravel()
pd.Series(Counter([(r, g) for _, r, g in pd.unique([(i, r, g) for i, r, g in zip(idx, reg, grp)]).tolist()])).unstack()

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories