Pivot large amount of per user data from rows into columns - python

I have a csv file (n types of products rated by users):
Simplified illustration of source table
--------------------------------
User_id | Product_id | Rating |
--------------------------------
1 | 00 | 3 |
1 | 02 | 5 |
2 | 01 | 1 |
2 | 00 | 2 |
2 | 02 | 2 |
I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:
as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls
Desired output
------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1 | 3 | 0 | 0 | 3 | 0 | 5 |
1 | 0 | 0 | 5 | 3 | 0 | 5 |
2 | 0 | 1 | 0 | 2 | 1 | 2 |
2 | 2 | 0 | 0 | 2 | 1 | 2 |
2 | 0 | 0 | 2 | 2 | 1 | 2 |
I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.
Update 1
I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)
df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64
Update 2
I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:
My code:
df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)
Outcome:
IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426
This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language

Related

How to generate new variable (column) based on a specific given column containing cycles of numbers from 0 to n, where n is a positive integer

The dataset contains data about COVID-19 patients. It is in both EXCEL and CSV file formats, and contains several variables and over 7 thousand records (rows) which has made the problem extremely harder and very time consuming to solve manually. Below are the 4 most important variables (columns) needed in solving the problem; 1: id for identifying each record (row), 2: day_at_hosp for each day a patient remained admitted at the hospital, 3: sex of patient, 4: death for whether the patient eventually died or survived.
I want to create a new variable total_days_at_hosp which should contain a total of days a patient remained admitted at hospital.
Old Table:
_______________________________________
| id | day_at_hosp | sex | death |
|_______|_____________|________|________|
| 1 | 0 | male | no |
| 2 | 1 | | |
| 3 | 2 | | |
| 4 | 0 | female | no |
| 5 | 1 | | |
| 6 | 0 | male | no |
| 7 | 0 | female | no |
| 8 | 0 | male | no |
| 9 | 1 | | |
| 10 | 2 | | |
| 11 | 3 | | |
| 12 | 4 | | |
| ... | ... | ... | ... |
| 7882 | 0 | female | no |
| 7883 | 1 | | |
|_______|_____________|________|________|
New Table:
I want to convert table above into table below:
____________________________________________
| id |total_days_at_hosp| sex | death |
|_______|__________________|________|________|
| 1 | 3 | male | no |
| 4 | 2 | male | yes |
| 6 | 1 | male | yes |
| 7 | 1 | female | no |
| 8 | 5 | male | no |
| ... | ... | ... | ... |
| 2565 | 2 | female | no |
|_______|__________________|________|________|
NOTE: the id column is for every record entered, and multiple records were entered for each patient depending on how long a patient remained admitted at the hospital. The day_at_hosp variable contains days: 0=initial day at hospital, 1=second day at hospital, ... , n=nth last day at hospital.
The record (row) where the variable (column) day_at_hosp is 0 corresponds to all entries in other columns, if the record (row) for day_at_hosp is *not 0, say 1,2,3, ...,5 then it belongs to the patient right above, and all the corresponding variables (columns) are left blank.
However the dataset I need should look like the table below.
It should include a new variable (column) called total_days_at_hosp generated from the variable (column) day_at_hosp. The new variable (column) total_days_at_hosp is more useful in statistical tests to be conducted and will replace variable (column) day_at_hosp, so that all blank rows can be deleted.
To move from old table to new table the needed program should do the following:
day_at_hosp ===> total_days_at_hosp
0
1 ---> 3
2
-------------------------------------
0 ---> 2
1
-------------------------------------
0 ---> 1
-------------------------------------
0 ---> 1
-------------------------------------
0
1
2 ---> 5
3
4
-------------------------------------
...
-------------------------------------
0 ---> 2
1
-------------------------------------
How can I achieve this?
Another formula option without dummy value placed at end of the Old/New Table.
1] Create New Table by >>
Copy and paste all Old Table data to a unused area
Click "Autofilter"
In "days_at_hospital" column select =0 value
Copy and paste filter of admissions to New Table column F
Delete all 0s in rows of Column G
Then,
2] In G2, formula copied down :
=IF(F2="","",IF(F3="",MATCH(9^9,A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
Remark : If your "ID Column" is Text value, formula changed to :
=IF(F2="","",IF(F3="",MATCH("zzz",A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
It is apparent that your data are sorted by patient, and that your desired table will be much 'shorter' - accordingly the starting point for this answer is to apply an AutoFilter to your original data, setting the filter criterion to be days_at_hospital = 0, and then copy this filter of admissions to column F:
after deleting the old column G data, the formula below can then be entered in cell G2 and copied down
=INDEX(B:B,MATCH(F3,A:A,0)-1)+1
to keep the formula simple the same dummy maximum value should be entered at both the end of the old and new tables.

How to count the number of items in a group after using Groupby in Pandas

I have multiple columns in my dataframe of which I am using 2 columns "customer id" and "trip id". I used the groupby function data.groupby(['customer_id','trip_id']) There are multiple trips taken from each customer. I want to count how many trips each customer took, but when I am using aggregate function along with group by I am getting 1 in all the rows. How should I proceed ?
I want something in this format.
Example :
Customer_id , Trip_Id, Count
CustID1 ,trip1, 3
trip 2
trip 3
CustID2 ,Trip450, 2
Trip23
You can group by customer and count the number of unique trips using the built in nunique:
data.groupby('Customer_id').agg(Count=('Trip_id', 'nunique'))
You can use data.groupby('customer_id','trip_id').count()
Example:
df1 = pd.DataFrame(columns=["c1","c1a","c1b"], data = [[1,2,3],[1,5,6],[2,8,9]])
print(df1)
# | c1 | c1a | c1b |
# |----|-----|-----|
# | x | 2 | 3 |
# | z | 5 | 6 |
# | z | 8 | 9 |
df2 = df1.groupby("c1").count()
print(df2)
# | | c1a | c1b |
# |----|-----|-----|
# | x | 1 | 1 |
# | z | 2 | 2 |

Transforming data frame (row to column and count)

Sorry for the dumb question, but I got stuck. I have the dataframe with the next structure:
|.....| ID | Cause | Date |
| 1 | AR | SGNLss| 10-05-2019 05:01:00|
| 2 | TD | PTRXX | 12-05-2019 12:15:00|
| 3 | GZ | FAIL | 10-05-2019 05:01:00|
| 4 | AR | PTRXX | 12-05-2019 12:15:00|
| 5 | GZ | SGNLss| 10-05-2019 05:01:00|
| 6 | AR | FAIL | 10-05-2019 05:01:00|
What I want is convert DATE column value to columns rounded to day so that the expected DF will have ID, 10-05-2019, 11-05-2019, 12-05-2019... columns and the values - the number of events (Causes) happened on this Id.
It's not a problem to round day and count values separately, but I can't get how to do both these operations.
You can use pd.crosstab:
pd.crosstab(df['ID'], df['Date'].dt.date)
Output:
Date 2019-10-05 2019-12-05
ID
AR 2 1
GZ 2 0
TD 0 1

How to compare two pandas dataframes and remove duplicates on one file without appending data from other file [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 4 years ago.
I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.
I am creating two pandas dataframes from these files:
import pandas as pd
dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)
I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.
The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.
I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.
Here is an example of what the dataframes could look like.
test_master.csv
column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2
test_daily.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
Desired output is:
test_master.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
test_daily.csv
+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+----------+----------+----------+
Any help would be greatly appreciated!
EDIT
I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.
Try this:
I crate 2 indexes, and then set rows 2-4 to be duplicates:
import numpy as np
test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values
print(test_master)
print(test_daily)
output:
A B C
0 0.009322 0.330057 0.082956
1 0.197500 0.010593 0.356774
2 0.147410 0.697779 0.421207
A B C
0 0.643062 0.335643 0.215443
1 0.009322 0.330057 0.082956
2 0.197500 0.010593 0.356774
3 0.147410 0.697779 0.421207
4 0.973867 0.873358 0.502973
Then, add a multiindex level to identify which data is from which dataframe:
test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)
Now merge as you suggested and drop duplicates:
merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
master 0.009322 0.330057 0.082956
1 master 0.197500 0.010593 0.356774
2 master 0.147410 0.697779 0.421207
4 daily 0.973867 0.873358 0.502973
There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:
idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
4 daily 0.973867 0.873358 0.502973

How do you convert two columns of vectors into one PySpark data frame?

After having run PolynomialExpansion on a Pyspark dataframe, I have a data frame (polyDF) that looks like this:
+--------------------+--------------------+
| features| polyFeatures|
+--------------------+--------------------+
|(81,[2,9,26],[13....|(3402,[5,8,54,57,...|
|(81,[4,16,20,27,3...|(3402,[14,19,152,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
The "features" column includes the features included in the original data. Each row represents a different user. There are 81 total possible features for each user in the original data. The "polyFeatures" column includes the features after the polynomial expansion has been run. There are 3402 possible polyFeatures after running PolynomialExpansion. So what each row of both columns contain are:
An integer representing the number of possible features (each user may or may not have had a value in each of the features).
A list of integers that contains the feature indexes for which that user had a value.
A list of numbers that contains the values of each of the features mentioned in #2 above.
My question is, how can I take these two columns, create two sparse matrices, and subsequently join them together to get one, full, sparse Pyspark matrix? Ideally it would look like this:
+---+----+----+----+------+----+----+----+----+---+---
| 1 | 2 | 3 | 4 | ... |405 |406 |407 |408 |409|...
+---+----+----+----+------+----+----+----+----+---+---
| 0 | 13 | 0 | 0 | ... | 0 | 0 | 0 | 6 | 0 |...
| 0 | 0 | 0 | 9 | ... | 0 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
I have reviewed the Spark documentation for PolynomialExpansion located here but it doesn't cover this particular issue. I have also tried to apply the SparseVector class which is documented here, but this seems to be useful for only one vector rather than a data frame of vectors.
Is there an effective way to accomplish this?

Categories