Transforming data frame (row to column and count) - python

Sorry for the dumb question, but I got stuck. I have the dataframe with the next structure:
|.....| ID | Cause | Date |
| 1 | AR | SGNLss| 10-05-2019 05:01:00|
| 2 | TD | PTRXX | 12-05-2019 12:15:00|
| 3 | GZ | FAIL | 10-05-2019 05:01:00|
| 4 | AR | PTRXX | 12-05-2019 12:15:00|
| 5 | GZ | SGNLss| 10-05-2019 05:01:00|
| 6 | AR | FAIL | 10-05-2019 05:01:00|
What I want is convert DATE column value to columns rounded to day so that the expected DF will have ID, 10-05-2019, 11-05-2019, 12-05-2019... columns and the values - the number of events (Causes) happened on this Id.
It's not a problem to round day and count values separately, but I can't get how to do both these operations.

You can use pd.crosstab:
pd.crosstab(df['ID'], df['Date'].dt.date)
Output:
Date 2019-10-05 2019-12-05
ID
AR 2 1
GZ 2 0
TD 0 1

Related

Pivot large amount of per user data from rows into columns

I have a csv file (n types of products rated by users):
Simplified illustration of source table
--------------------------------
User_id | Product_id | Rating |
--------------------------------
1 | 00 | 3 |
1 | 02 | 5 |
2 | 01 | 1 |
2 | 00 | 2 |
2 | 02 | 2 |
I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:
as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls
Desired output
------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1 | 3 | 0 | 0 | 3 | 0 | 5 |
1 | 0 | 0 | 5 | 3 | 0 | 5 |
2 | 0 | 1 | 0 | 2 | 1 | 2 |
2 | 2 | 0 | 0 | 2 | 1 | 2 |
2 | 0 | 0 | 2 | 2 | 1 | 2 |
I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.
Update 1
I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)
df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64
Update 2
I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:
My code:
df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)
Outcome:
IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426
This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language

Python, Pandas: shrinking a dataframe

inside my application i have a dataframe that looks similiar to this:
Example:
id | address | code_a | code_b | code_c | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | 1345wqdwqe | ....
2 | parkdrive 1 | 012ah8 | 012ah8a | dwqd4646 | ....
3 | parkdrive 2 | 852fhz | 852fhza | fewf6465 | ....
4 | parkdrive 3 | 456se1 | 456se1a | 856fewf13 | ....
5 | parkdrive 3 | 456se1 | 456se1a | gth8596s | ....
6 | parkdrive 3 | 456se1 | 456se1a | a48qsgg | ....
7 | parkdrive 4 | tg8596 | tg8596a | 134568a | ....
As you may see, every address can contain multiple entrys inside my dataframe, the code_a and code_b are following a certain pattern and only code_c is unqiue.
What I'm trying to obtain is a dataframe where the column code_c is ignored, dropped or whatever and the whole dataframe is reduced to only one entry for each address...something like this:
id | address | code_a | code_b | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | ...
3 | parkdrive 2 | 852fhz | 852fhza | ...
4 | parkdrive 3 | 456se1 | 456se1a | ...
7 | parkdrive 4 | tg8596 | tg8596a | ...
I tried the groupby-function, but this doesn't seemed to work - or is this even the right function?
Thanks for your help and good day to all of you!
You can drop_duplicates to do this
df.drop_duplicates(subset=[‘address’], inplace=True)
This will keep only a single entry per address
I think what you are looking for is
# in this way you are looking for all the duplicates rows in all columns except for 'code_c'
df.drop_duplicates(subset=df.columns.difference(['code_c']))
# in this way you are looking for all the duplicates rows ONLY based on column 'address'
df.drop_duplicates(subset='address')
I notice in your example data, if you drop columnC then all the entries with address "parkdrive 1" for example, are just duplicates.
you should drop the column c:
df.drop('code_c',axis=1,inplace=True)
Then you can drop the duplicates:
df_clean = df.drop_duplicates()

How to count the number of items in a group after using Groupby in Pandas

I have multiple columns in my dataframe of which I am using 2 columns "customer id" and "trip id". I used the groupby function data.groupby(['customer_id','trip_id']) There are multiple trips taken from each customer. I want to count how many trips each customer took, but when I am using aggregate function along with group by I am getting 1 in all the rows. How should I proceed ?
I want something in this format.
Example :
Customer_id , Trip_Id, Count
CustID1 ,trip1, 3
trip 2
trip 3
CustID2 ,Trip450, 2
Trip23
You can group by customer and count the number of unique trips using the built in nunique:
data.groupby('Customer_id').agg(Count=('Trip_id', 'nunique'))
You can use data.groupby('customer_id','trip_id').count()
Example:
df1 = pd.DataFrame(columns=["c1","c1a","c1b"], data = [[1,2,3],[1,5,6],[2,8,9]])
print(df1)
# | c1 | c1a | c1b |
# |----|-----|-----|
# | x | 2 | 3 |
# | z | 5 | 6 |
# | z | 8 | 9 |
df2 = df1.groupby("c1").count()
print(df2)
# | | c1a | c1b |
# |----|-----|-----|
# | x | 1 | 1 |
# | z | 2 | 2 |

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?
You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed
Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you

Python - Pandas - Converting column with specific subsets into rows

I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840

Categories