I have a csv file (n types of products rated by users):
Simplified illustration of source table
--------------------------------
User_id | Product_id | Rating |
--------------------------------
1 | 00 | 3 |
1 | 02 | 5 |
2 | 01 | 1 |
2 | 00 | 2 |
2 | 02 | 2 |
I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:
as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls
Desired output
------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1 | 3 | 0 | 0 | 3 | 0 | 5 |
1 | 0 | 0 | 5 | 3 | 0 | 5 |
2 | 0 | 1 | 0 | 2 | 1 | 2 |
2 | 2 | 0 | 0 | 2 | 1 | 2 |
2 | 0 | 0 | 2 | 2 | 1 | 2 |
I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.
Update 1
I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)
df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64
Update 2
I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:
My code:
df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)
Outcome:
IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426
This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language
I have three dfs. df1 contains 46 columns. df2 and df3 contain 41 columns and contain threshold values for columns in df1 which indicate if it needs a repeat or repeat with addition. Below are simplified examples
df1:
|Name | A | B | C |.......
------------------------------
0|ID1 | 10 | 2 | 3 |
1|ID2 | 400 | 1 | 6 |
2|ID3 | 7 | 8 | 9 |
3|ID4 | 12 | 300 | 55 |
4|ID5 | 0 | 1 | 2 |
df2:
| A | B | C |.......
------------------------------
Repeat | 10 | 2 | 50 |
df3:
| A | B | C |.......
------------------------------
Repeat w Addition| 100 | 200 | 500 |
what I'd like to do is create a new column in df1 with the values "Repeat", "Repeat with Addition" or "No" based on the following conditions
for each row if any value in cols A-C is greater than repeat addition = Repeat Addition
for each row if any value in cols A-C is greater than repeat but less than Repeat Addition= Repeat
else No
desired output:
df1:
|Name | A | B | C |.......|Repeat Required?|
--------------------------------------------------
0|ID1 | 10 | 2 | 3 |.......| Repeat
1|ID2 | 400 | 1 | 6 |.......| Repeat with Addition
2|ID3 | 7 | 8 | 9 |.......| Repeat
3|ID4 | 12 | 300 | 55 |.......| Repeat with Addition
4|ID5 | 0 | 1 | 2 |.......| No
what I have so far:
I tried using a function with np.select to fill the column but it produces a bunch of No values when that is not correct
def repeat_required(df):
conds = [df >= df3.loc["Repeat w Addition"], df>= df2.loc["Repeat"]]
labels = ['Repeat with Dilution', 'Repeat']
return np.select(conds, labels, default='No')
df1["Repeat Required?"]=""
df1["Repeat Required?"]=repeat_required(df1.iloc[:,4:-1]) #the first 4 columns contain strings
You're right that you want to use np.select, but the conditions you need to provide are Boolean Series that are the same length as df1. To do this, you need to compare with the rows in df2 and df3 as Series (so that it aligns on the columns) and then check if any of the values in df1 satisfy your conditions.
You can manually specify the columns to compare, as I do in a list below, or you can not specify anything and leverage the fact that pandas will automatically align for the .ge comparison so it would only consider overlapping columns for an any check.
import numpy as np
cols = ['A', 'B', 'C']
conds = [df1[cols].ge(df3[cols].loc['Repeat w Addition']).any(1),
df1[cols].ge(df2[cols].loc['Repeat']).any(1)]
choices = ['Repeat w Addition', 'Repeat']
df1['Repeat Required'] = np.select(conds, choices, default='No')
print(df1)
Name A B C Repeat Required
0 ID1 10 2 3 Repeat
1 ID2 400 1 6 Repeat w Addition
2 ID3 7 8 9 Repeat
3 ID4 12 300 55 Repeat w Addition
4 ID5 0 1 2 No
I have multiple columns in my dataframe of which I am using 2 columns "customer id" and "trip id". I used the groupby function data.groupby(['customer_id','trip_id']) There are multiple trips taken from each customer. I want to count how many trips each customer took, but when I am using aggregate function along with group by I am getting 1 in all the rows. How should I proceed ?
I want something in this format.
Example :
Customer_id , Trip_Id, Count
CustID1 ,trip1, 3
trip 2
trip 3
CustID2 ,Trip450, 2
Trip23
You can group by customer and count the number of unique trips using the built in nunique:
data.groupby('Customer_id').agg(Count=('Trip_id', 'nunique'))
You can use data.groupby('customer_id','trip_id').count()
Example:
df1 = pd.DataFrame(columns=["c1","c1a","c1b"], data = [[1,2,3],[1,5,6],[2,8,9]])
print(df1)
# | c1 | c1a | c1b |
# |----|-----|-----|
# | x | 2 | 3 |
# | z | 5 | 6 |
# | z | 8 | 9 |
df2 = df1.groupby("c1").count()
print(df2)
# | | c1a | c1b |
# |----|-----|-----|
# | x | 1 | 1 |
# | z | 2 | 2 |
Sorry for the dumb question, but I got stuck. I have the dataframe with the next structure:
|.....| ID | Cause | Date |
| 1 | AR | SGNLss| 10-05-2019 05:01:00|
| 2 | TD | PTRXX | 12-05-2019 12:15:00|
| 3 | GZ | FAIL | 10-05-2019 05:01:00|
| 4 | AR | PTRXX | 12-05-2019 12:15:00|
| 5 | GZ | SGNLss| 10-05-2019 05:01:00|
| 6 | AR | FAIL | 10-05-2019 05:01:00|
What I want is convert DATE column value to columns rounded to day so that the expected DF will have ID, 10-05-2019, 11-05-2019, 12-05-2019... columns and the values - the number of events (Causes) happened on this Id.
It's not a problem to round day and count values separately, but I can't get how to do both these operations.
You can use pd.crosstab:
pd.crosstab(df['ID'], df['Date'].dt.date)
Output:
Date 2019-10-05 2019-12-05
ID
AR 2 1
GZ 2 0
TD 0 1
I have a dataframe in Pandas with collected data;
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+
The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:
+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+
I tried both with for loops and by using groups (for example:
df.groupby(['Group','Subgroup']).sum()
print(groups.head)
) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?
From another answer (big shoutout to #Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)
Can anyone give me a pointer is the right direction?
Simple enough, you'll need groupby here:
Using groupby + diff, figure out how many observations are missing per Group and SubGroup
Group df on Group, and compute the size and sum of the column computed in the previous step
A couple more straightforward steps (calculating the %) give you your intended output.
f = [ # declare an aggfunc list in advance, we'll need it later
('Total Observations', 'size'),
('Missed', 'sum')
]
g = df.groupby(['Group', 'Subgroup'])\
.Obs.diff()\
.sub(1)\
.groupby(df.Group)\
.agg(f)
g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100
g
Total Observations Missed %
Group
A 8.0 1.0 12.500000
B 9.0 2.0 22.222222
A similar approach using groupby, apply and assign:
(
df.groupby(['Group','Subgroup']).Obs
.apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
.apply(pd.Series)
.groupby(level=0).sum()
.assign(pct=lambda x: x[1]/x[0]*100)
.set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)
Out[75]:
Total Observations Missed %
Group
A 8 1 12.500000
B 9 2 22.222222
from collections import Counter
gs = ['Group', 'Subgroup']
old_tups = set(zip(*df.values.T))
missed = pd.Series(Counter(
g for (g, s), d in df.groupby(gs)
for o in range(d.Obs.min(), d.Obs.max() + 1)
if (g, s, o) not in old_tups
), name='Missed')
hit = df.set_index(gs).Obs.count(level=0)
total = hit.add(missed).rename('Total')
ratio = missed.div(total).rename('%')
pd.concat([total, missed, ratio], axis=1).reset_index()
Group Total Missed %
0 A 8 1 0.125000
1 B 9 2 0.222222