I have a Pandas dataframe something like:
Feature A
Feature B
Feature C
A1
B1
C1
A2
B2
C2
Given k as input, i want all values combination grouped by feature of length k, for example for k = 2 I want:
[{A:A1, B:B1},
{A:A1, B:B2},
{A:A1, C:C1},
{A:A1, C:C2},
{A:A2, B:B1},
{A:A2, B:B2},
{A:A2, C:C1},
{A:A2, C:C2},
{B:B1, C:C1},
{B:B1, C:C2},
{B:B2, C:C1},
{B:B2, C:C2}]
How can I achieve that?
This is probably not that efficient but it works for small scale.
First, determine the unique combinations of k columns.
from itertools import combinations
k = 2
cols = list(combinations(df.columns, k))
Then use MultiIndex.from_product to get cartesian product of k columns.
result = []
for c in cols:
result += pd.MultiIndex.from_product([df[x] for x in c]).values.tolist()
Related
I have a function that takes a pandas dataframe with index labels in the form of <int>_<int> (which basically denotes some size ranges in µm), and columns that hold values for separate samples in those size ranges.
The size ranges are consecutive as in the following example:
df = pd.DataFrame({'A': ['a', 'd', 'g', 'j'], 'B': ['b', 'e', 'h', 'k'], 'C': ['c', 'f', 'i', 'l']}, index = ['0_10', '10_20', '20_30', '30_40'])
A B C
0_10 a b c
10_20 d e f
20_30 g h i
30_40 j k l
Note: for demonstration purpose the values are letter here. The real values are float64 numbers.
Here is the code that I am using so far. The docstring shows you what it is doing. As such it works fine, however, the nested loop and the iterative creation of new rows makes it very slow. For a dataframe with 200 rows and 21 columns it runs for about 2 min.
def combination_sums(df): # TODO: speed up
"""
Append new rows to a DF, where each new row is a column-wise sum of an original row
and any possible combination of consecutively following rows. The input DF must have
an index according to the scheme below.
Example:
INPUT DF OUTPUT DF
A B C A B C
0_10 a b c 0_10 a b c
10_20 d e f --> 10_20 d e f
20_30 g h i 20_30 g h i
30_40 j k l 30_40 j k l
0_20 a+d b+e c+f
0_30 a+d+g b+e+h c+f+i
0_40 a+d+g+j b+e+h+k c+f+i+l
10_30 d+g e+h f+i
10_40 d+g+j e+h+k f+i+l
20_40 g+j h+k i+l
"""
ol = len(df) # original length
for i in range(ol):
for j in range(i+1,ol):
new_row_name = df.index[i].split('_')[0] + '_' + df.index[j].split('_')[1] # creates a string for the row index from the first and the last rows in the sum
df.loc[new_row_name] = df.iloc[i:j].sum()
return df
I am wondering what could be a better way to make it more efficient. I.e. using intermediate conversion to a numpy array and doing it in a vectorised operation. From somewhat similar posts (e.g. here), I thought there could be a way with numpy mgrid or ogrid, however, it was not similar enough for me to adapt it to what I want to achieve.
Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.
Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
UPDATE to handle duplicate values.
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
Try this:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
yields
3
First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
To work with dicts instead of dataframes, you can use filter(re, joined) as in this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
Third option with series.isin() inspired by jsmart's answer
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
Speed testing
I repeated data 100,000 times for scalability testing. series.isin() takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s
I don't understand this line of code
minimum.append(min(j[1]['Data_Value']))
...specifically
j[1]['Data_Value']
I know the full code returns the minimum value and stores it in a list called minimum, but what does the j[1] do there? I've tried using other numbers to figure it out but get an error. Is it selecting the index or something?
Full code below. Thanks!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
df1 = pd.read_csv('./data/C2A2_data/BinnedCsvs_d400/ed157460d30113a689e487b88dcbef1f5d64cbd8bb7825f5f485013d.csv')
minimum = []
maximum = []
month = []
df1 = df1[~(df1['Date'].str.endswith(r'02-29'))]
times1 = pd.DatetimeIndex(df1['Date'])
df = df1[times1.year != 2015]
times = pd.DatetimeIndex(df['Date'])
for j in df.groupby([times.month, times.day]):
minimum.append(min(j[1]['Data_Value']))
maximum.append(max(j[1]['Data_Value']))
Explanation
pandas.groupby returns a list of tuples, (key, dataframe). Key is the groupby key; the key value of that group. See below for example.
Looping over these j's, means looping over these tuples.
j[0] refers to the group "key"
j[1] means taking the dataframe component of that tuple. ['Data_Value'] takes a column of that dataframe.
Example
df = pd.DataFrame({'a': [1, 1, 2], 'b': [2, 4, 6]})
df_grouped = df.groupby('a')
for j in df_grouped:
print(f"Groupby key (col a): {j[0]}")
print("dataframe:")
print(j[1])
Yields:
Groupby key (col a): 1
dataframe:
a b
0 1 2
1 1 4
Groupby key (col a): 2
dataframe:
a b
2 2 6
More readable solution
Another, more comfortable, way to get the min/max of Data_Value for every month-day combination is this:
data_value_summary = df \
.groupby([times.month, times.day]) \
.agg({'Data_Value': [min, max]}) \
['Data_Value'] # < this removed the 2nd header from the newly created dataframe
minimum = data_value_summary['min']
maximum = data_value_summary['max']
I've got a defaultdict with a nested dictionary from which I'm trying to get the sum of the values. But I've been struggling to find a way to do this.
In the example below, I'm trying to count all the count values:
from collections import defaultdict
x = defaultdict(dict)
x['test1']['count'] = 14
x['test4']['count'] = 14
x['test2']['count'] = 14
x['test3']['count'] = 14
print x
""" methods I've tried """
# print x.values()
# print sum(x for y in x.values() for x in y['count'].iteritems())
# print sum(x.itervalues())
The methods above that I've tried (in many different variations) didn't provide the desired results.
Any clues or assistance as to where I may be in error?
If you have to caluculate sum of just 'count' key, you may do:
>>> sum(y['count'] for y in x.values())
56
If there is a possibility of having other keys as well (apart from 'count'), and you want to calculate sum of all the values, then you have to do:
>>> sum(z for y in x.values() for z in y.values())
56
# OR,
# import itertools
# sum(itertools.chain(*[y.values() for y in x.values()]))
Just sum(x[k]['count'] for k in x) should work.
If you want to sum the values of all sub dictionaries, sum twice:
>>> sum(sum(y.values()) for y in x.values())
56
I have a list of bi-grams like this:
[['a','b'],['e', ''f']]
Now I want to add these bigrams to a DataFrame with their frequencies like this:
b f
a|1 0
e|0 1
I tried doing this with the following code, but this raises an error, because the index doesn't exist yet. Is there a fast way to do this for really big data? (like 200000 bigrams)
matrixA = pd.DataFrame()
# Put the counts in a matrix
for elem in grams:
tag1, tag2 = elem[0], elem[1]
matrixA.loc[tag1, tag2] += 1
from collections import Counter
bigrams = [[['a','b'],['e', 'f']], [['a','b'],['e', 'g']]]
pairs = []
for bg in bigrams:
pairs.append((bg[0][0], bg[0][1]))
pairs.append((bg[1][0], bg[1][1]))
c = Counter(pairs)
>>> pd.Series(c).unstack() # optional: .fillna(0)
b f g
a 2 NaN NaN
e NaN 1 1
The above is for the intuition. This can be wrapped up in a one line generator expression as follows:
pd.Series(Counter((bg[i][0], bg[i][1]) for bg in bigrams for i in range(2))).unstack()
You can use Counter from the collections package. Note that I changed the contents of the list to be tuples rather than lists. This is because Counter keys (like dict keys) must be hashable.
from collections import Counter
l = [('a','b'),('e', 'f')]
index, cols = zip(*l)
df = pd.DataFrame(0, index=index, columns=cols)
c = Counter(l)
for (i, c), count in c.items():
df.loc[i, c] = count