How to plot pivot chart in python? - python

Currenly, I am new in python scripting I am using panda, pivottablejs for creating a script. I have one csv file and I read that csv file using panda and I got the table like this.
Now, I want to generate the pivotchart using pivottablejs so for that I have to pass dataframe object in the pivot_ui();
I want to Plot in Pivot Chart the total number of issue status created for every OriginationPhase.
So I tried something like this.
LabelsReviewedByDate = issues_df.groupby(['Status','OriginationPhase'])
pivot_ui(LabelsReviewedByDate)
I know this is wrong but I am new in python scripting. So help me to find the solution.
Thank you

You can just pass the dataframe right to pivot_ui:
import pandas as pd
from pivottablejs import pivot_ui
a= [ [1,'Requirements','bug'],[2,'Design','bug'],[3,'Testing','bug'],[4,'Requirements','bug'],[5,'Requirements','Inquiry'] ]
df = pd.DataFrame(a,columns =['Issue#','OriginationPhase','Category'])
pivot_ui(df)

The pivot_table method comes to solve this problem. It works like pivot, but it aggregates the values from rows with duplicate entries for the specified columns
a= [ [1,'Requirements','bug'],[2,'Design','bug'],[3,'Testing','bug'],[4,'Requirements','bug'],[5,'Requirements','Inquiry'] ]
df = pd.DataFrame(a,columns =['Issue#','OriginationPhase','Category'])
df.pivot_table( index = 'Category',columns = 'OriginationPhase',aggfunc = lambda x: len(x) ) )
Issue#
OriginationPhase Design Requirements Testing
Category
Inquiry NaN 1 NaN
bug 1 2 1

Related

Setting category and type for multiple columns possible?

I have a dataset which contains 6 columns TIME1 to TIME6, amongst others. For each of these I need to apply the code below (which is shown for 2 columns). LISTED is a prepared list of the possible elements to be seen in these columns.
Is there a way to do this without writing the same 2 lines 6 times?
df['PART1'] = df['TIME1'].astype('category')
df['PART1'].cat.set_categories(LISTED, inplace=True)
df['PART2'] = df['TIME2'].astype('category')
df['PART2'].cat.set_categories(LISTED, inplace=True)
For astype(first line of code), I tried the following:
for col in ['TIME1', 'TIME2', 'TIME3', 'TIME4', 'TIME5', 'TIME6']:
df_col = df[col].astype('category')
I think this works (not sure how to check without the whole code working). But how could I do something similar for the second line of code with the set_categories etc?
In short, I'm looking for something short/more elegant that just copying and modifying the same 2 lines 6 times.
I am new to python, any help is greatly appreciated.
Using python 2.7 and pandas 0.24.2
Yes it is possible! We can change the dtype of multiple columns to categorical is one go by creating CategoricalDtype
i = pd.RangeIndex(1, 7).astype(str)
df['PART' + i] = df['TIME' + i].astype(pd.CategoricalDtype(LISTED))

Remove extra index in a dataframe

I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.
My code is
data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')
The output is this table with extra index:
I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.
The dataset
https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing
I hope pandas.DataFrame.droplevel can do the job for you query
import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)
Output
df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)
MultiIndex([('service_type_id', '3 Phase Wiring'),
('service_type_id', 'AV Equipment')
...
('service_type_id', 'Yoga Lessons'),
('service_type_id', 'Zumba Classes')],
names=[None, 'service_type'], length=188)
#print(data_tr.info())
Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows
print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
dtype='object', name='service_type', length=188)
It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.
For insight, try to execute just
data.groupby(['transaction_id', 'service_type']).sum()
Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot
data_tr = data[['transaction_id'
, 'service_type'
, 'service_type_id']
].pivot(index = 'transaction_id'
, columns= 'service_type'
, values = 'service_type_id'
).fillna(0)
Depending on how you like your data structure, you can follow up with a .reset_index()

for loop list to dataframe

I have the following for loop for a dataframe
# this is my data
df=yf.download('AAPL', period='max', interval='1d' )
vwap15 = []
for i in range(0,len(df)-1):
if(i>=15):
vwap15.append(sum(df["Close"][i-15:i]*df["Volume"][i-15:i])/sum(df["Volume"][i-15:i]))
else:
vwap15.append(None)
When I created the above for loop it generated a list.
I actually want to have it as a dataframe that I can join to my original dataframe df
any insights would be appreciated
thanks
Maybe you mean something like (right after the loop):
df["vwap15"] = vwap15
Note that you will need to fix your for loop like so (otherwise lengths will not match):
for i in range(len(df)):
Maybe you want to have a look at currently available packages for Technical Analysis indicators in Python with Pandas.
Also, try to use NaN instead of None and consider using the Pandas .rolling method when computing indicators over a time window.

Method to dynamically build large dataframe (spark or pandas) for export to csv

I have a csv that I import into databricks using spark.read. This large file contains records/transactions on a daily level. I trim the dataframe down to 5 columns and leave the 500,000 rows as-is. I am trying to build a summary table of this source file that represents these records/transactions at a month level (aggregate).
The script has a filter/groupby/sum command that returns one row that summarizes the data into counts for a month. A row that is returned by the query would look like this:
+---------+---------+-------+-------------+
| Country|StockCode|YYYY-MM|sum(Quantity)|
+---------+---------+-------+-------------+
|Singapore| M| 2011-4| 10|
+---------+---------+-------+-------------+
the script iterates over the source dataframe and returns each time. I am having the trouble being able to use the output (display or csv export) of this script. Both in pyspark and pandas I have problems. I'm not sure how to stack the result of the query and what form it should be in?
#Pandas
If I do it in pandas, the script takes very long to generate the file (I believe pandas + me doing it not so efficiently is causing the extended duration) ~ 2.5 hours. The display and write.csv commands works rather quickly though and complete in approx a few seconds.
#Pyspark
If I do this in pyspark the script takes about 10 minutes to complete, but the display and the export crash. The notebook either returns a timeout error, restarts or throws crash errors.
Should the approach be to create a list of lists dynamically, and when that is completely built, convert that to a dataframe for use? I've been trying all the ways I have come across and I seem to not make any progress.
Here is the code that generates the results
#officeSummaryDFBefore
column_names = "Country|StockCode|YYYY-MM|Quantity"
monthlyCountsBeforeImpactDate = spark.createDataFrame(
[
tuple('' for i in column_names.split("|"))
],
column_names.split("|")
).where("1=0")
monthlyCountsBeforeImpacteDateRow = spark.createDataFrame(
[
tuple('' for i in column_names.split("|"))
],
column_names.split("|")
).where("1=0")
try :
for country in country_lookup :
country = country[0]
print(country_count, " country(s) left")
country_count = country_count - 1
for stockCode in stockCode_lookup :
stockCode = stockCode[0]
monthlyCountsBeforeImpacteDateRow = dataBeforeImpactDate.filter((col("Country").rlike(country)) & (col("StockCode").rlike(stockCode))).groupby("Country", "StockCode", "YYYY-MM").sum()
monthlyCountsBeforeImpacteDateRow.show()
dfsCountsBefore = [monthlyCountsBeforeImpacteDateRow, monthlyCountsBeforeImpactDate]
monthlyCountsBeforeImpactDate = reduce(DataFrame.union, dfsCountsBefore)
except Exception as e:
print(e)
I declare dfsCountsBeforeImpactDate inside the loop which doesn't seem right, but when it is outside it comes back as NULL.
IIUC
You are doing a lookup on country and stock to restrict the rows then grouping over them to generate the aggregations.
Why not filter the df's altogether then grouping
df = dataBeforeImpactDate
df = df.filter(col('country').isin(country_lookup) & col('stock').isin(stock_lookup))
df = df.groupby("Country", "StockCode", "YYYY-MM").sum()
df.show()
This will be way faster as you are not looping around for filter and also no need of union.

Python - Linking columns in Excel for sorting

The problem that I have to solve:
I'm trying to automate several processes in excel. I'm currently stuck on the first one. (Also I'm pretty weak at using excel so I apologize in advance if some of the things I saw don't make sense. I scraped data from the internet and inputted into an excel file. I concat'ed that data with a spreadsheet I already had. Here's the code I used to combine files.
import numpy as np
import pandas as pd
def MergeFiles():
#find both csv files on computer
baseData = pd.read_csv('pathname') #keep this on the left
scrapedData = pd.read_csv('pathname') #keep this on the right
mergedFile = pd.concat([baseData, scrapedData], axis = 1)
mergedFile.to_csv('pathname', index = False)
MergeFiles()
What I want to do:
Col1 Col2
c 1
b 2
a 3
-Alphabetically Order Col 1 and values in col2 also shift
Col1 Col2
a 3
b 2
c 1
I'm trying to link columns together so if I try to sort all rows go through the same position shift.
Also any help would be appreciated, I tried looking into Pandas documentation and I couldn't find anything related to this problem. I probably missed something so any help would be appreciated!
So apparently the pandas library does all of this automatically through sort_values()
So
scrapedData = scrapedData.sort_values(by = ['colName'], ascending=True,) #sort the scrapedData
scrapedData.to_csv('pathName', index = False) #replace the file
would do the trick

Categories