Merging pandas column from dataframe to another dataframe based on their indices - python

I have a data frame, df_one that looks like this where video_id is the index:
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched | avg_time_watched |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
| video_id | | | | | | |
| 5 | 17 | 12.000000 | 17 | 1 | 1 | 1.000000 |
| 10 | 22 | 10.000000 | 1 | 1 | 1 | 0.045455 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 | 1.000000 |
| 22 | 29 | 20.000000 | 5 | 1 | 1 | 0.172414 |
+----------+--------------+---------------+--------------+----------------+---------------+------------------+
And I have another dataframe, df_two that looks like this where video_id is also the index:
+----------+--------------+---------------+--------------+----------------+------------------------+
| | video_length | feed_position | time_watched | unique_watched | count_watched_yeterday |
+----------+--------------+---------------+--------------+----------------+------------------------+
| video_id | | | | | |
| 5 | 102 | 11.333333 | 73 | 6 | 6 |
| 15 | 22 | 13.000000 | 22 | 1 | 1 |
| 16 | 44 | 2.000000 | 15 | 1 | 1 |
| 17 | 180 | 23.333333 | 53 | 6 | 6 |
| 18 | 40 | 1.000000 | 40 | 1 | 1 |
+----------+--------------+---------------+--------------+----------------+------------------------+
What I want to do is merge the count_watched_yeterday column from df_two to df_one based on the index of each.
I tried:
video_base = pd.merge(df_one, df_two['count_watched_yeterday'], how='left', on=[df_one.index, df_two.index])
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Actually I think the easiest thing to do here is to directly assign:
In [13]:
df['count_watched_yesterday'] = df1['count_watched_yeterday']
df['count_watched_yesterday']
Out[13]:
video_id
5 6
10 NaN
15 1
22 NaN
Name: count_watched_yesterday, dtype: float64
This works because it will align on the index values, where you have no matching values a NaN will be assigned as the value

Related

cumsum bounded within a range(python, pandas)

I have a df where I'd like to have the cumsum be bounded within a range of 0 to 6. Where sum over 6 will be rollover to 0. The adj_cumsum column is what I'm trying to get. I've search and found a couple of posts using loops, however, since mine is more straightforward, hence, is wondering whether there is a less complicated or updated approach.
+----+-------+------+----------+----------------+--------+------------+
| | month | days | adj_days | adj_days_shift | cumsum | adj_cumsum |
+----+-------+------+----------+----------------+--------+------------+
| 0 | jan | 31 | 3 | 0 | 0 | 0 |
| 1 | feb | 28 | 0 | 3 | 3 | 3 |
| 2 | mar | 31 | 3 | 0 | 3 | 3 |
| 3 | apr | 30 | 2 | 3 | 6 | 6 |
| 4 | may | 31 | 3 | 2 | 8 | 1 |
| 5 | jun | 30 | 2 | 3 | 11 | 4 |
| 6 | jul | 31 | 3 | 2 | 13 | 6 |
| 7 | aug | 31 | 3 | 3 | 16 | 2 |
| 8 | sep | 30 | 2 | 3 | 19 | 5 |
| 9 | oct | 31 | 3 | 2 | 21 | 0 |
| 10 | nov | 30 | 2 | 3 | 24 | 3 |
| 11 | dec | 31 | 3 | 2 | 26 | 5 |
+----+-------+------+----------+----------------+--------+------------+
data = {"month": ['jan','feb','mar','apr',
'may','jun','jul','aug',
'sep','oct','nov','dec'],
"days": [31,28,31,30,31,30,31,31,30,31,30,31]}
df = pd.DataFrame(data)
df['adj_days'] = df['days'] - 28
df['adj_days_shift'] = df['adj_days'].shift(1)
df['cumsum'] = df.adj_days_shift.cumsum()
df.fillna(0, inplace=True)
Kindly advise
What you are looking for is called a modulo operation.
Use df['adj_cumsum'] = df['cumsum'].mod(7).
Intuition:
df["adj_cumsum"] = df["cumsum"].apply(lambda x:x%7)
Am I right?

determine chain of predecessors and successor from a list of first predecessor in python

I have a list like the following
+----+-------------------+
| id | first_predecessor |
+----+-------------------+
| 0 | 4 |
| 1 | 5 |
| 2 | 6 |
| 3 | 17,18 |
| 4 | 7 |
| 5 | 8 |
| 6 | 9 |
| 7 | 10,11,12 |
| 8 | 13,14,15 |
| 9 | 16 |
| 10 | Input |
| 11 | Input |
| 12 | Input |
| 13 | Input |
| 14 | Input |
| 15 | Input |
| 16 | Input |
| 17 | 19 |
| 18 | 20 |
| 19 | 21 |
| 20 | 22 |
| 21 | Input |
+----+-------------------+
One item can have multiple immediate incoming ids, like in case of id=3, which is imediately preceeded by id=17 and id=18.
I need a python code to determine this result by following the chain of predecessors both ways:
(it is best to read the column all_successors to understand the logic, all_predecessors is the same logic backwards)
+----+-------------------+------------------+----------------+
| id | first_predecessor | all_predecessors | all_successors |
+----+-------------------+------------------+----------------+
| 0 | 4 | 4,7,10,11,12 | |
| 1 | 5 | 5,8,13,14,15 | |
| 2 | 6 | 6,9,16 | |
| 3 | 17,18 | 19,21,20,22 | |
| 4 | 7 | 7,10,11,12 | 0 |
| 5 | 8 | 8,13,14,15 | 1 |
| 6 | 9 | 9,16 | 2 |
| 7 | 10,11,12 | 10,11,12 | 0,4 |
| 8 | 13,14,15 | 13,14,15 | 1,5 |
| 9 | 16 | 16 | 2,6 |
| 10 | Input | | 0,4,7 |
| 11 | Input | | 0,4,7 |
| 12 | Input | | 0,4,7 |
| 13 | Input | | 1,5,8 |
| 14 | Input | | 1,5,8 |
| 15 | Input | | 1,5,8 |
| 16 | Input | | 2,6,9 |
| 17 | 19 | 19,21 | 3 |
| 18 | 20 | 20,22 | 3 |
| 19 | 21 | 21 | 3,17 |
| 20 | 22 | 22 | 3,18 |
| 21 | Input | | 3,17,19 |
| 22 | Input | | 3,18,20 |
+----+-------------------+------------------+----------------+
I need some kind of recursive solution, or should I use some graph package?
You can use the following functions to find all predecessors and all successors.
ancestors(G, source): Returns all nodes having a path to source in G.
descendants(G, source): Returns all nodes reachable from source in G.
To run the following example, make sure you change INPUT in your id column to NaN.
df_ = df.copy()
df_['first_predecessor'] = df_['first_predecessor'].str.split(',')
df_ = df_.explode('first_predecessor')
df_['first_predecessor'] = df_['first_predecessor'].fillna(-1).astype(int)
G = nx.from_pandas_edgelist(df_, 'first_predecessor', 'id', create_using=nx.DiGraph())
G.remove_node(-1)
df['all_predecessors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.ancestors(G, x)))))
df['all_successors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.descendants(G, x)))))
print(df)
id first_predecessor all_predecessors all_successors
0 0 4 4,7,10,11,12
1 1 5 5,8,13,14,15
2 2 6 6,9,16
3 3 17,18 17,18,19,20,21,22
4 4 7 7,10,11,12 0
5 5 8 8,13,14,15 1
6 6 9 9,16 2
7 7 10,11,12 10,11,12 0,4
8 8 13,14,15 13,14,15 1,5
9 9 16 16 2,6
10 10 NaN 0,4,7
11 11 NaN 0,4,7
12 12 NaN 0,4,7
13 13 NaN 1,5,8
14 14 NaN 1,5,8
15 15 NaN 1,5,8
16 16 NaN 2,6,9
17 17 19 19,21 3
18 18 20 20,22 3
19 19 21 21 3,17
20 20 22 22 3,18
21 21 NaN 3,17,19

How do I sum columns on the basis of a condition using pandas?

The df has the following columns,
col1 | col2 | col3 | Jan-19 | Feb-19 | Mar-19 | Apr-19 | May-19 | Jun-19 | Jul-19 | Aug-19 | Sep-19 | Oct-19 | Nov-19 | Dec-19 | Jan-20 | Feb-20 | Mar-20 | Apr-20 | May-20 | Jun-20 | Jul-20 | Aug-20 | Sep-20 | Oct-20 | Nov-20 | Dec-20
ab | cd | | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98
The month columns have numbers. I want to sum the month columns on the following condition,
Condition,
If the datetime.now().strftime('%b-%Y') is anything from Jun-19(for example) to Oct-19, then I want to sum the month columns from Oct-19 to Feb-20. If it was anything from Jun-20 to Oct-20, then sum of columns from Oct-20 to Feb-21 and so on.
If the datetime.now().strftime('%b-%Y') is anything from Nov-19 to May-19, then I want to sum the month columns from Mar-20 to Sep-20. If it was anything from Nov-20 to May-20, then sum of columns Mar-21 to Sep-21 and so on.
There should be a Total column at the end.
col1 | col2 | col3 | Jan-19 | Feb-19 | Mar-19 | Apr-19 | May-19 | Jun-19 | Jul-19 | Aug-19 | Sep-19 | Oct-19 | Nov-19 | Dec-19 | Jan-20 | Feb-20 | Mar-20 | Apr-20 | May-20 | Jun-20 | Jul-20 | Aug-20 | Sep-20 | Oct-20 | Nov-20 | Dec-20 | Total
ab | cd | | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 10 | 12 | 14 | 15 | 16 | 12 | 13 | 7 | 82 | 76 | 100 | 98 | 296
Is there a way to create a generic condition for this so that it may work for x month and y year?
It is still confusing about what you are actually want to do.
But to your case, my suggestion is you can select the columns by their names and transpose the table.
Then you can sum the values along the row axis.
It is not very time costing on DataFrame.
In my opinion, the operation across the col axis in DataFrame is always harder than across the row axis.
Since in the row operation, one can use .query() function to easily filter the entries they want.
But not in the col direction.

Pandas cant print the list of objects collected from web using xpath in Jupyter notebook

This is the code I used. I am using Jupternotebook web version. I upgraded the XML, and python version is 3.8.
import numpy as np
import requests
from lxml import html
import csv
import pandas as pd
# getting the web content
r = requests.get('http://www.pro-football-reference.com/years/2017/draft.htm')
data = html.fromstring(r.text)
collecting specific data
pick = data.xpath('//td[#data_stat="draft_pick"]//text()')
player = data.xpath('//td[#data_stat="player"]//text()')
position = data.xpath('//td[#data_stat="pos"]//text()')
age= data.xpath('//td[#data_stat="age"]//text()')
games_played = data.xpath('//td[#data_stat="g"]//text()')
cmp = data.xpath('//td[#data_stat="pass_cmp"]//text()')
att = data.xpath('//td[#data_stat="pass_att"]//text()')
college = data.xpath('//td[#data_stat="college_id"]//text()')
data = list(zip(pick,player,position,age,games_played,cmp,att,college))
df = pd.DataFrame(data)
df
There are two errors showing on two separate files I tried:
<module 'pandas' from 'C:\Users\anaconda3\lib\site-packages\pandas\init.py'>
AttributeError: 'list' object has no attribute 'xpath'
The code is not giving me the list of data I wanted from the webpage. Can anyone help me out with this? Thank you in advance.
You can load html tables directly into a dataframe using read_html:
import pandas as pd
df = pd.read_html('http://www.pro-football-reference.com/years/2017/draft.htm')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rnd'].ne('Rnd')] # remove mid-table header rows
Output:
| | Rnd | Pick | Tm | Player | Pos | Age | To | AP1 | PB | St | CarAV | DrAV | G | Cmp | Att | Yds | TD | Int | Att | Yds | TD | Rec | Yds | TD | Solo | Int | Sk | College/Univ | Unnamed: 28_level_1 |
|---:|------:|-------:|:-----|:------------------|:------|------:|-----:|------:|-----:|-----:|--------:|-------:|----:|------:|------:|------:|-----:|------:|------:|------:|-----:|------:|------:|-----:|-------:|------:|------:|:---------------|:----------------------|
| 0 | 1 | 1 | CLE | Myles Garrett | DE | 21 | 2020 | 1 | 2 | 4 | 35 | 35 | 51 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 107 | nan | 42.5 | Texas A&M | College Stats |
| 1 | 1 | 2 | CHI | Mitchell Trubisky | QB | 23 | 2020 | 0 | 1 | 3 | 33 | 33 | 51 | 1010 | 1577 | 10609 | 64 | 37 | 190 | 1057 | 8 | 0 | 0 | 0 | nan | nan | nan | North Carolina | College Stats |
| 2 | 1 | 3 | SFO | Solomon Thomas | DE | 22 | 2020 | 0 | 0 | 2 | 15 | 15 | 48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 73 | nan | 6 | Stanford | College Stats |
| 3 | 1 | 4 | JAX | Leonard Fournette | RB | 22 | 2020 | 0 | 0 | 3 | 25 | 20 | 49 | 0 | 0 | 0 | 0 | 0 | 763 | 2998 | 23 | 170 | 1242 | 2 | nan | nan | nan | LSU | College Stats |
| 4 | 1 | 5 | TEN | Corey Davis | WR | 22 | 2020 | 0 | 0 | 4 | 25 | 25 | 56 | 0 | 0 | 0 | 0 | 0 | 6 | 55 | 0 | 207 | 2851 | 11 | nan | nan | nan | West. Michigan | College Stats |

Filter all rows from groupby object

I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]

Categories