"Zipping" two dataframes by column values

"Zipping" two dataframes by column values - python

all
Suppose I have a dataframe like:
df1 = pd.DataFrame({"A": range(6), "key": [0,1]*3})
df1
A key
0 0 0
1 1 1
2 2 0
3 3 1
4 4 0
5 5 1
and
df2 = pd.DataFrame({"C": ["k0-"+str(x) for x in range(3)] + ["k1-"+str(x) for x in range(3)] , "key": [0]*3 + [1]*3}) k0-1
df2
C key
0 k0-0 0
1 k0-1 0
2 k0-2 0
3 k1-0 1
4 k1-1 1
5 k1-2 1
Values in C are all unique and values in key have no such pattern in a real dataset.
I'm trying to merge the two with a resulting dataframe, where values in column C will be taken exactly once for a matching value in column key.
I.e.
A key C
0 0 0 k0-0
1 1 1 k1-0
2 2 0 k0-1
3 3 1 k1-1
4 4 0 k0-2
5 5 1 k1-2
The order doesn't matter, i.e. values in C do not need to be taken sequentially. This is a toy example, I have ~10 keys in reality.
I know I can probably do an outer join and then somehow drop the non-unique C values. But this could be an overkill, as there are too many rows in the real datasets (~30k).
Thanks in advance!

You can add an extra column to be used in the join:
df1['order'] = df1.groupby('key').cumcount()
df2['order'] = df2.groupby('key').cumcount()
# If you want to match on random order:
# df2['order'] = df2.sample(frac=1).groupby('key').cumcount()
df1.merge(df2, on=['key', 'order'])
Result:
A key order C
0 0 0 0 k0-0
1 1 1 0 k1-0
2 2 0 1 k0-1
3 3 1 1 k1-1
4 4 0 2 k0-2
5 5 1 2 k1-2

You can build a dictionary of iterators and call next on the appropriate iterator depending on the 'key'.
g = {k: iter(v) for k, v in df2.groupby('key').C}
df1.assign(C=[next(g[x]) for x in df1.key])
A key C
0 0 0 k0-0
1 1 1 k1-0
2 2 0 k0-1
3 3 1 k1-1
4 4 0 k0-2
5 5 1 k1-2

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1

For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]

Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining

The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.

Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2

Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2

The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

Grouped by set of columns, first non zero value and one of all zeros in a column needs to be flagged as 1 and rest as 0

import pandas as pd
df = pd.DataFrame({'Org1': [1,1,1,1,2,2,2,2,3,3,3,4,4,4],
'Org2': ['x','x','y','y','z','y','z','z','x','y','y','z','x','x'],
'Org3': ['a','a','b','b','c','b','c','c','a','b','b','c','a','a'],
'Value': [0,0,3,1,0,1,0,5,0,0,0,1,1,1]})
df
For each unique set of "Org1, Org2, Org3" and based on the "Value"
The first non zero "value" should have "FLAG" = 1 and others = 0
If all "value" are 0 then one of the row's "FLAG" = 1 and others = 0
If "value" are all NON ZERO in a Column then first instance to have FLAG = 1 and others 0
I was using the solutions provided in
Flag the first non zero column value with 1 and rest 0 having multiple columns
One difference is in the above Point 2 isnt covered
"If all "value" are 0 then one of the row's "FLAG" = 1 and others = 0"

You can modify linked solution with remove .where:
m = df['Value'].ne(0)
idx = m.groupby([df['Org1'],df['Org2'],df['Org3']]).idxmax()
df['FLAG'] = df.index.isin(idx).astype(int)
print (df)
Org1 Org2 Org3 Value FLAG
0 1 x a 0 1
1 1 x a 0 0
2 1 y b 3 1
3 1 y b 1 0
4 2 z c 0 0
5 2 y b 1 1
6 2 z c 0 0
7 2 z c 5 1
8 3 x a 0 1
9 3 y b 0 1
10 3 y b 0 0
11 4 z c 1 1
12 4 x a 1 1
13 4 x a 1 0

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?

Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Pandas groupby treat nonconsecutive as different variables?

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution

Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

"Zipping" two dataframes by column values - python

You can build a dictionary of iterators and call next on the appropriate iterator depending on the 'key'. g = {k: iter(v) for k, v in df2.groupby('key').C} df1.assign(C=[next(g[x]) for x in df1.key]) A key C 0 0 0 k0-0 1 1 1 k1-0 2 2 0 k0-1 3 3 1 k1-1 4 4 0 k0-2 5 5 1 k1-2

Related

Sort column names using wildcard using pandas

Get maximum occurance of one specific value per row with pandas

Grouped by set of columns, first non zero value and one of all zeros in a column needs to be flagged as 1 and rest as 0

pandas: Grouping or filtering based on values in list, instead of dataframe

Pandas groupby treat nonconsecutive as different variables?

Categories

Resources