I have a dataframe that is missing certain values and I want to generate records for those and input the value with 0.
My df looks like this:
import pyspark.pandas as ps
import databricks.koalas as ks
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Some entries in this df are missing, such as the year 2017 for South Africa and the year 2018 for Japan.
I want to generate those entries and add 0 in the columns Quantity, Price and Value.
I managed to do this on a smaller dataset using pandas, however, when I try to implement this using pyspark.pandas, I get an error.
This is the code I have so far:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique(),
df['Country'].unique(),
df['Product'].unique(),
df['Year'].unique()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())
Whenever I run it, I get the following issue:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Any ideas why this might happen and how to fix it?
Ok so looking closer at the synatx for ps.MultiIndex, the variables have to be passed as list, so I had to add .tolist() after each column's unique values. Code below:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique().tolist(),
df['Country'].unique().tolist(),
df['Product'].unique().tolist(),
df['Year'].unique().tolist()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())
Related
Here is a sample data of a larger race result dataset that I have.
import pandas as pd
# initialize data of lists.
data = {'Car': ['A', 'B', 'C', 'D','C', 'A', 'B', 'D','B','C'],
'time': [2012, 2013, 2000, 2012, 2009, 2012, 2013,2000, 2002, 1999],
'pos':[2,1,1,1,3,2,2,2,3,1],
'update':[110738604,110738604,110738604,110738604, 110743097, 110743097, 110743097, 110743097, 110809881, 110509881]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
This dataset gets updated every few seconds(not in a fix interval I guess),
I want to find the Top car for every update and return a message that which car is leading, also in the next update if the top car is faster than previous top one then change the top car announcement.
I did
df_sliced={}
for name in df['update'].unique():
df_sliced[name]=df[df['update']==name]
df_sliced[name] = df_sliced[name].sort_values('time')
top = df_sliced[name]['time'].idxmin()
print('the top car is {} for update {}.'.format(df_sliced[name]['Car'].loc[top], df_sliced[name]['update'].loc[top]))
I tried to first add a new column to display word ‘top’ if the second group top be faster than first group,
df_sliced[name]['top'] = df_sliced[name]['time'].loc[[top]].values > df_sliced[name]['time'].min()
yet stoked and don’t know how to do this.Please help.
I am trying to learn how to use pyspark.pandas and I am coming across an issue that I don't know how to solve. I have a df of about 700k rows and 7 columns. Here is a sample of my data:
import pyspark.pandas as ps
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Even when I run the simplest of operations like df.head(), I get the following warning and I'm not sure how to fix it:
WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I know how to work around this with pyspark dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.
Does anyone have any suggestions?
For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html
I think the goal here is to run Pandas functions on a Spark DataFrame. One option you can use is Fugue. Fugue can take a Python function and apply it on Spark per partition. Example code below.
from typing import List, Dict, Any
import pandas as pd
df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
"id": (["A"]*3 + ["B"]*3 + ["C"]*3),
"value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
def count(df: pd.DataFrame) -> pd.DataFrame:
# this assumes the data is already partitioned
id = df.iloc[0]["id"]
count = df.shape[0]
return pd.DataFrame({"id": [id], "count": [count]})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
# Pandas
pdf = transform(df.copy(),
count,
schema="id:str, count:int",
partition={"by": "id"})
print(pdf.head())
# Spark
transform(sdf,
count,
schema="id:str, count:int",
partition={"by": "id"},
engine=spark).show()
You just need to annotate your function with input and output types and then you can use it with the Fugue transform function. Schema is a requirement for Spark so you need to pass it. If you supply spark as the engine, then the execution will happen on Spark. Otherwise, it will run on Pandas by default.
I'm working on a dataframe where I need to add a new column which is based on other existing rows in the dataframe. Here is a simplified version of what I'm trying to do. Basically there is a dataframe with the purchase and sale record of a particular stock:
data = {'Name' :['Al', 'John', 'Jack', 'Jack', 'Al', 'John', 'Jack', 'Jack'],
'TradeType' : ['Purch', 'Sold', 'Sold', 'Purch', 'Sold', 'Sold', 'Purch', 'Sold'],
'Date' : [ 2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022]}
df = pd.DataFrame(data)
I want to add a new column called 'nTrans2Year' that tells me how many transaction that person has done in the previous two years. The way I am currently doing it is such: I create e temporary dataframe with the right filters and than calculate its length.
df['nTrans2Year'] = 0
df = df.reset_index(drop=True)
for index, row in df.iterrows():
insName = row['Name']
date = row['Date']
tradeType = row['TradeType']
t_df = df[(df['Name'] == insName) & (df['TradeType'] == tradeType) & (df['Date'] < date) & (df['Date'] > (date - dt.timedelta(days = 365*2)))]
df.loc[index ,'nTrans2Year'] = len(t_df)
The only issue with this approach is that it is very computationally intensive, and given that my original dataframe is 300k rows long it's not really an option. Does anybody have a more efficient way of achieving the same result?
When we save a pandas dataframe as a partitioned parquet, the filenames are generated automatically.
Is it possible to specify the output filenames of each partition?
Using an example
df = pd.DataFrame(data={'year': [2020, 2020, 2021],
'month': [1,12,2],
'day': [1,31,28],
'value': [1000,2000,3000]})
df.to_parquet('./output', partition_cols=['year', 'month'])
output/year=2020/month=1/6f0258e6c48a48dbb56cae0494adf659.parquet
output/year=2020/month=12/cf8a45116d8441668c3a397b816cd5f3.parquet
output/year=2021/month=2/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet
Is it possible to get
output/year=2020/month=1/2020_01.parquet
output/year=2020/month=12/2020_12.parquet
output/year=2021/month=2/2021_02.parquet
Thanks for your time
Apologies, I didn't even know how to title/describe the issue I am having, so bear with me. I have the following code:
import pandas as pd
data = {'Invoice Number':[1279581, 1279581,1229422, 1229422, 1229422],
'Project Key':[263736, 263736, 259661, 259661, 259661],
'Project Type': ['Visibility', 'Culture', 'Spend', 'Visibility', 'Culture']}
df= pd.DataFrame(data)
How do I get the output to basically group the Invoice Numbers so that there is only 1 row per Invoice Number and combine the multiple Project Types (per that 1 Invoice) into 1 row?
Code and output for output is below.
Thanks much appreciated.
import pandas as pd
data = {'Invoice Number':[1279581,1229422],
'Project Key':[263736, 259661],
'Project Type': ['Visibility_Culture', 'Spend_Visibility_Culture']
}
output = pd.DataFrame(data)
output
>>> (df
.groupby(['Invoice Number', 'Project Key'])['Project Type']
.apply(lambda x: '_'.join(x))
.reset_index()
)
Invoice Number Project Key Project Type
0 1229422 259661 Spend_Visibility_Culture
1 1279581 263736 Visibility_Culture