I have a pandas dataframe as below.I want to create list of columns by iterating over list called 'fields_list' and separate out lists which ends with the list in 'fields_list'
import pandas as pd
import numpy as np
import sys
df = pd.DataFrame({'a_balance': [3,4,5,6], 'b_balance': [5,1,1,1]})
df['ah_balance'] = 0
df['a_agg_balance'] = 0
df['b_agg_balance'] = 0
df
a_balance b_balance ah_balance a_agg_balance b_agg_balance
3 5 0 0 0
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
fields_list = [ ['<val>','_balance'],['<val_class>','_agg_balance']]
fields_list
[['<val>', '_balance'], ['<val_class>', '_agg_balance']]
for i,field in fields_list:
df_final= [col for col in df if col.endswith(field)]
print("df_final" ,df_final)
I tried above code but when it iterates over 1st element of fields_list(i.e. '', '_balance') it also includes elements that ends with '_agg_balance' and hence I get the below result
df_final ['a_balance', 'b_balance', 'ah_balance', 'a_agg_balance', 'b_agg_balance']
df_final ['a_agg_balance', 'b_agg_balance']
My expected output is
df_final ['a_balance', 'b_balance', 'ah_balance']
df_final ['a_agg_balance', 'b_agg_balance']
You can sort the suffixes you're looking at, and start with the longest one. When you find a column that matches a suffix, remove it from the set of columns you need to look at:
fields_list = [ ['<val>','_balance'],['<val_class>','_agg_balance']]
sorted_list = sorted(fields_list, key=lambda x: len(x[1]), reverse = True)
sorted_suffixes = [x[1] for x in sorted_list]
col_list = set(df.columns)
for suffix in sorted_suffixes:
forecast_final_fields = [col for col in col_list if col.endswith(suffix)]
col_list.difference_update(forecast_final_fields)
print("forecast_final_fields" ,forecast_final_fields)
Results in
forecast_final_fields ['a_agg_balance', 'b_agg_balance']
forecast_final_fields ['ah_balance', 'a_balance', 'b_balance']
Related
I'm hoping to count specific values from a pandas df. Using below, I'm subsetting Item by Up and grouping Num and Label to count the values in Item. The values in the output are correct but I want to drop Label and include Up in the column headers.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'Label' : ['A','B','A','B','B','B','A','B','B','A','A','A','B','A','B','A'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right','Down','Right','Up','Up','Right','Down','Left'],
})
df1 = (df[df['Item'] == 'Up']
.groupby(['Num','Label'])['Item']
.count()
.unstack(fill_value = 0)
.reset_index()
)
intended output:
Num A_Up B_Up
1 3 0
2 1 1
With your approach, you can include the Item in the grouper.
out = (df[df['Item'] == 'Up'].groupby(['Num','Label','Item']).size()
.unstack(['Label','Item'],fill_value=0))
out.columns=out.columns.map('_'.join)
print(out)
A_Up B_Up
Num
1 3 0
2 1 1
You can use Groupby.transform to have all column names. Then use df.pivot_table and a list comprehension to get your desired column names.
In [2301]: x = df[df['Item'] == 'Up']
In [2304]: x['c'] = x.groupby(['Num','Label'])['Item'].transform('count')
In [2310]: x = x.pivot_table(index='Num', columns=['Label', 'Item'], aggfunc='first', fill_value=0)
In [2313]: x.columns = [j+'_'+k for i,j,k in x.columns]
In [2314]: x
Out[2314]:
A_Up B_Up
Num
1 3 0
2 1 1
For example, I have a list of 100 data frames, some have column length of 8, others 10, others 12. I want to be able to split these into groups based on their column length. I have tried dictionaries but couldn't get it to append properly in a loop.
Previously tried code:
col_count = [8, 10, 12]
d = dict.fromkeys(col_count, [])
for df in df_lst:
for i in col_count:
if i == len(df.columns):
d[i] = df
but this just seems to replace the values in the dict each time. I have tried .append also, but that seems to append to all keys.
Instead of assigning a df to d[column_count]. You should append it.
You initialized d with d = dict.fromkeys(col_count, []) so d is a dictionary of empty lists.
When you do d[i] = df you replace the empty list by a DataFrame, so d will be a dictionary of DataFrame. If you do d[i].append(df) you will have a dictionary of list of DataFrame. (which is what you want AFAIU)
Also i'm not sure that you need the col_count variable. You could just do d[len(df.columns)].append(df).
I think this should suffice for you. Think of how to dynamically solve your problems to make better use of Python.
In [2]: import pandas as pd
In [3]: for i in range(1, 5):
...: exec(f"df{i} = pd.DataFrame(0, index=range({i}), columns=list('ABCD'))") #making my own testing list of dataframes with variable length
...:
In [4]: df1 #one row df
Out[4]:
A B C D
0 0 0 0 0
In [5]: df2 #two row df
Out[5]:
A B C D
0 0 0 0 0
1 0 0 0 0
In [6]: df3 #three row df
Out[6]:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
In [7]: L = [df1, df2, df3, df4, df5] #i assume all your dataframes are put into something like a container, which is the problem
In [13]: my_3_length_shape_dfs = [] #you need to create some sort of containers for your lengths (you can do an additional exec in the following In
In [14]: for i in L:
...: if i.shape[0] == 3: #add more of these if needed, you mentioned your lengths are known [8, 10, 12]
...: my_3_length_shape_dfs.append(i) #adding the df to a specified container, thus grouping any dfs that are of row length/shape equal to 3
...: print(i)
...:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6
Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare