I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column.
E.g.
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
The following code works but is very inefficient.
for elem in set(df.values.flat):
print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])
a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1
This is however very inefficient and my dataframe is large. The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df.
Is there a fast way of doing this?
You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.
Timings
Using the following setup to produce a larger sample dataset:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop
OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)
Related
I have a Pandas data frame represented by the one below:
A B C D
| 1 1 1 3 |
| 1 1 1 2 |
| 2 3 4 5 |
I need to iterate through this data frame, looking for rows where the values in columns A, B, & C match and if that's true check the values in column D for those rows and delete the row with the smaller value. So, in above example would look like this afterwards.
A B C D
| 1 1 1 3 |
| 2 3 4 5 |
I've written the following code, but something isn't right and it's causing an error. It also looks more complicated than it may need to be, so I am wondering if there is a better, more concise way to write this.
for col, row in df.iterrows():
... df1 = df.copy()
... df1.drop(col, inplace = True)
... for col1, row1 in df1.iterrows():
... if df[0].iloc[col] == df1[0].iloc[col1] & df[1].iloc[col] == df1[1].iloc[col1] &
df[2].iloc[col] == df1[2].iloc[col1] & df1[3].iloc[col1] > df[3].iloc[col]:
... df.drop(col, inplace = True)
Here is one solution:
df[~((df[['A', 'B', 'C']].duplicated(keep=False)) & (df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']))]
Explanation:
df[['A', 'B', 'C']].duplicated(keep=False)
returns a mask for rows with duplicated values of ['A', 'B', 'C'] columns
df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']
returns a mask for rows that have the minimum value for ['D'] column, for each group of ['A', 'B', 'C']
The combination of these masks, selects all these rows (duplicated ['A', 'B', 'C'] and minimum 'D' for the group. With ~ we select all other rows except from these ones.
Result for the provided input:
A B C D
0 1 1 1 3
2 2 3 4 5
You can groupby all the variables (using groupby(['A', 'B', 'C'])) which have to be equal and then exclude the row with minimum value of D (using func)) if there are multiple unique records to get the boolean indices for the rows which has to be retained
def func(x):
if len(x.unique()) != 1:
return x != x.min()
else:
return x == x
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: func(x))]
A B C D
0 1 1 1 3
2 2 3 4 5
If row having just the maximum group value in D has to be retained. Then you can use the below:
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: x == x.max())]
I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n
I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]
I have a dataset showing which cities each vehicle has been to (as shown in df1 below).
I'm trying to create a list of two-city combinations based on df1, and then for each two-city combination count how many vehicles have been to that particular two-city combination (like df2 below).
I dug around but couldn't find a solution. Does anyone have a solution for this?
(any help will be appreciated)
df1= pd.DataFrame([
[1,'A'],[1,'B'],[1,'C'],
[2,'A'],[2,'C'],[2,'C'],[2,'A'],
[3,'C'],[3,'B'],[3,'C'],[3,'B']],columns=['Vehicle_ID','City'])
df2= pd.DataFrame([['A,B',1],['B,C',2],['A,C',2]],
columns=['City_Combination','Vehicle_Count'])
Note:
(1) Order of cities visited doesn't matter. Eg. under the ('A,B') combination, vehicles that visited (A -> B) or (B -> A) or (A -> C -> B) will all be counted.
(2) Frequency of city visited doesn't matter. Eg. under the ('A,B') combination a vehicle that visited (A -> B -> A -> A) is still counted as 1 vehicle.
Here are two options. The first way is to group by the Vehicle_ID and for each group generate all the combinations of two cities. Collect the resulting city pairs and Vehicle_ID in a set of tuples (since we don't care about repeated city pairs) and then use the set to generate a new DataFrame. Then groupby the city pairs and count the distinct Vehicle_IDs:
df1 = df1.drop_duplicates()
data = set()
for vid, grp in df1.groupby(['Vehicle_ID']):
for c1, c2 in IT.combinations(grp['City'], 2):
if c1 > c2:
c1, c2 = c2, c1
data.add((c1, c2, vid))
df = pd.DataFrame(list(data), columns=['City_x', 'City_y', 'Vehicle_Count'])
# City_x City_y Vehicle_Count
# 0 B C 3
# 1 A C 1
# 2 B C 1
# 3 A C 2
# 4 A B 1
result = df.groupby(['City_x', 'City_y']).count()
yields
Vehicle_Count
City_x City_y
A B 1
C 2
B C 2
An alternative way is to merge df1 with itself:
In [244]: df1 = df1.drop_duplicates()
In [246]: df3 = pd.merge(df1, df1, on='Vehicle_ID', how='left'); df3
Out[246]:
Vehicle_ID City_x City_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 C A
12 2 C C
13 3 C C
14 3 C B
15 3 B C
16 3 B B
Unfortunately for us, pd.merge generates the direct product of city pairs, so
we need to remove rows where City_x >= City_y:
In [247]: mask = df3['City_x'] < df3['City_y']
In [248]: df3 = df3.loc[mask]; df3
Out[249]:
Vehicle_ID City_x City_y
1 1 A B
2 1 A C
5 1 B C
10 2 A C
15 3 B C
And now we can once again groupby City_x, City_y and count the result:
In [251]: result = df3.groupby(['City_x', 'City_y']).count(); result
Out[251]:
Vehicle_ID
City_x City_y
A B 1
C 2
B C 2
import numpy as np
import pandas as pd
import itertools as IT
def using_iteration(df1):
df1 = df1.drop_duplicates()
data = set()
for vid, grp in df1.groupby(['Vehicle_ID']):
for c1, c2 in IT.combinations(grp['City'], 2):
if c1 > c2:
c1, c2 = c2, c1
data.add((c1, c2, vid))
df = pd.DataFrame(list(data), columns=['City_x', 'City_y', 'Vehicle_Count'])
result = df.groupby(['City_x', 'City_y']).count()
return result
def using_merge(df1):
df1 = df1.drop_duplicates()
df3 = pd.merge(df1, df1, on='Vehicle_ID', how='left')
mask = df3['City_x'] < df3['City_y']
df3 = df3.loc[mask]
result = df3.groupby(['City_x', 'City_y']).count()
result = result.rename(columns={'Vehicle_ID':'Vehicle_Count'})
return result
def generate_df(nrows, nids, strlen):
cities = (np.random.choice(list('ABCD'), nrows*strlen)
.view('|S{}'.format(strlen)))
ids = np.random.randint(nids, size=(nrows,))
return pd.DataFrame({'Vehicle_ID':ids, 'City':cities})
df1 = pd.DataFrame([
[1, 'A'], [1, 'B'], [1, 'C'],
[2, 'A'], [2, 'C'], [2, 'C'], [2, 'A'],
[3, 'C'], [3, 'B'], [3, 'C'], [3, 'B']], columns=['Vehicle_ID', 'City'])
df = generate_df(10000, 50, 2)
assert using_merge(df).equals(using_iteration(df))
If df1 is small, using_iteration may be faster than using_merge. For example,
with the df1 from the original post,
In [261]: %timeit using_iteration(df1)
100 loops, best of 3: 3.45 ms per loop
In [262]: %timeit using_merge(df1)
100 loops, best of 3: 4.39 ms per loop
However, if we generate a DataFrame with 10000 rows and 50 Vehicle_IDs and 16 Citys,
then using_merge may be faster than using_iteration:
df = generate_df(10000, 50, 2)
In [241]: %timeit using_merge(df)
100 loops, best of 3: 7.73 ms per loop
In [242]: %timeit using_iteration(df)
100 loops, best of 3: 16.3 ms per loop
Generally speaking, the more iterations required by the for-loops in
using_iteration -- i.e. the more Vehicle_IDs and possible city pairs -- the
more likely NumPy- or Pandas-based methods (such as pd.merge) will be faster.
Note however, that pd.merge generates a bigger DataFrame than we ultimately need. So using_merge may require more memory than using_iteration. So at some point, for sufficiently big df1s, using_merge may require swap space which can make using_merge slower than using_iteration.
So it is best to test using_iteration and using_merge (and other solutions) on your actual data to see what is fastest.
First let's pivot the table so that the cities are columns, and there's one row per vehicle:
In [50]: df1['n'] = 1
In [51]: df = df1.pivot_table(index='Vehicle_ID', columns = 'City', values = 'n', aggfunc=sum)
df
Out[51]:
City A B C
Vehicle_ID
1 1 1 1
2 2 NaN 2
3 NaN 2 2
Now we can get the combinations with itertools.combinations (note we have to coerce to list to view all the values at once, since itertools by default returns an iterator):
from itertools import combinations
city_combos = list(combinations(df1.City.unique(), 2))
city_combos
Out[19]: [('A', 'B'), ('A', 'C'), ('B', 'C')]
Finally we can iterate through the combos and compute the counts:
In [87]: pd.Series({c:df[list(c)].notnull().all(axis=1).sum() for c in city_combos})
Out[87]:
A B 1
C 2
B C 2
dtype: int64
I am trying to classify my data in percentile buckets based on their values. My data looks like,
a = pnd.DataFrame(index = ['a','b','c','d','e','f','g','h','i','j'], columns=['data'])
a.data = np.random.randn(10)
print a
print '\nthese are ranked as shown'
print a.rank()
data
a -0.310188
b -0.191582
c 0.860467
d -0.458017
e 0.858653
f -1.640166
g -1.969908
h 0.649781
i 0.218000
j 1.887577
these are ranked as shown
data
a 4
b 5
c 9
d 3
e 8
f 2
g 1
h 7
i 6
j 10
To rank this data, I am using the rank function. However, I am interested in the creating a bucket of the top 20%. In the example shown above, this would be a list containing labels ['c', 'j']
desired result : ['c','j']
How do I get the desired result
In [13]: df[df > df.quantile(0.8)].dropna()
Out[13]:
data
c 0.860467
j 1.887577
In [14]: list(df[df > df.quantile(0.8)].dropna().index)
Out[14]: ['c', 'j']