my db has the following format:
database
I'm trying to get a count of each region grouped by month and i've tried different methods but still can't figure it out. I keep getting a total count for each month but can't get it to break down by region by month
the last method I tried was set_index:
import pandas as pd
import numpy as np
import datetime
%matplotlib inline
ax['date'] = pd.to_datetime(ax['date'])
ax = report[['date','region']].copy()
ax['month'] = pd.DatetimeIndex(ax['date']).month
ax = ax.set_index('date').groupby(pd.Grouper(freq='M')).count()
ax
but this one returns a total count of reoccurences for each region.
output
any ideas how to solve it?
Related
extracting the number of orders for each category according to its' date by the number of the product type appeared using python keep in mind that the date variable is not the data as a column i transformed the date and time variable into a date variable
i used the following liabraries
#Importing libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline sns.set()
these are my products
print(" The Total number for each product type: ")
data.products.value_counts()
the output is
productA 8124
productB 3047
the trasformation
data['date'] = pd.to_datetime(data['Assessment_Date']).dt.date
data['date']=pd.to_datetime(data['date'])
here where i tried to count ho many number order for each product based on the date
data.set_index('date',inplace=True)
and the output is
KEYERROR: 'date' is not a column in the data
please if you have another way to get the output bellow let me now
PRODUCT TYPE DATE NUMBER OF ORDERS
productA 1-11-2020 11
productB 1-11-2020 122
and so on for the whole period
thank you
I got a datetime indexed dataframe with 1 entry per hour of the year (format is "2019-01-01 00:00:00" for exemple).
I created a program which will plot every weeks, but some of the plots I obtain are weird
I was thinking that it may be a continuity problem in my dataframe, some data that would'nt be indexed at the good place, but I don't know how to check this.
If someone got a clue, it'll help me a lot!
Have a nice day all
Edit : I'll try to provide you some code
First of all I can't provide you the exact data i'm using since it's professionnal, but I'll try to adapt my code to a randomly generate dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
mpl.rc('figure', max_open_warning = 0)
df = pd.DataFrame({'datetime': pd.date_range('2019-01-01', '2020-12-31',freq='1H', closed='left')})
df['datetime']=pd.to_datetime(df['datetime'])
df['week'] = df['datetime'].dt.isocalendar().week
df['month'] = df['datetime'].dt.month
df=df.set_index(['datetime'])
df=df[['data1','data2','data3','data4','week','month']]
df19=df.loc['2019-01':'2019-12']
df20=df.loc['2020-01':'2020-12']
if not os.path.exists('mypath/Programmes/Semaines/2019'):
os.mkdir('mypath/Programmes/Semaines/2019')
def graph(a): #Creating the function that will generate all the data I need for 2019 and place them in the good folder, skipping the 1st week of the year cause it's buggued
for i in range (2,53):
if not os.path.exists('mypath/Programmes/Semaines/2019/'+str(a)):
os.mkdir('mypath/Programmes/Semaines/2019/'+str(a))
folder='mypath/Programmes/Semaines/2019/'+str(a)
plt.figure(figsize=[20,20])
x=df19[[a]][(df19["week"]==i)]
plt.plot(x)
name=str(a)+"_"+str(i)
plt.savefig(os.path.join(folder,name))
return
for j in df19.columns :
graph(j)
Hoping this can help even if i'm not providing the datas directly :/
I'm trying to create a stacked bar-graph which shows two transaction types for a customer. The graph is sorted into columns by week.
Sample code within my code structure is below:
%matplotlib inline
import pandas as pd
values = [('1','2019-07-28','retail',11),
('1','2019-07-28','wholesale',18),
('1','2019-08-04','retail',7),
('1','2019-08-04','wholesale',12),
('1','2019-08-11','retail',6),
('1','2019-08-11','wholesale',16)]
columns = ['customer_id','week',
'transaction_type',
'sale_count']
df = pd.DataFrame(values, columns=columns)
df.groupby(['week','transaction_type']).size()\
.unstack()\
.plot(sort_columns='week',
kind='bar', stacked=True);
The result I'm getting is a row count for each transaction_type as either 1 or 2
current:
What I need is a stacked bar graph that gives the sum of sale_count for each date listed in week like the one below
expected:
Can anyone tell me what I'm doing wrong here?
Similar to commented:
(df.groupby(['week','transaction_type'])['sale_count']
.sum().unstack('transaction_type')
.plot.bar(stacked=True)
)
Output:
#Quang Hoang's answer is correct and should be accepted and upvoted. This is just a note about formatting code. I guess it will be better to get rid of extra round brackets and move legend outside as in the following code
df.groupby(['week','transaction_type'])['sale_count']\
.sum().unstack('transaction_type')\
.plot.bar(stacked=True, rot=0)\
.legend(bbox_to_anchor=(1.3, 1.0));
I have a json file which looks like the one shown in the picture.
How can I import and print all the Quantity and Rate in Pandas?
How can I print the sum of all the Quantity for Buy and Sell separately?
How can I print the sum of all the Quantity who's values is greater than x. For eg:SUM(Qty > 5)
In raw format, the data is like this
{"success":true,"message":"","result":{"buy":[{"Quantity":199538.30948659,"Rate":0.00000970},{"Quantity":62142.31715449,"Rate":0.00000968},{"Quantity":233476.03486058,"Rate":0.00000967},{"Quantity":75613.30879931,"Rate":0.00000966},{"Quantity":3109.14961399,"Rate":0.00000965},{"Quantity":66.22406639,"Rate":0.00000964},{"Quantity":401.06420081,"Rate":0.00000963},{"Quantity":186.93339628,"Rate":0.00000961},{"Quantity":122731.01165366,"Rate":0.00000960},{"Quantity":7718.27750144,"Rate":0.00000959},{"Quantity":802.00000000,"Rate":0.00000958},{"Quantity":2050.72163419,"Rate":0.00000956},{"Quantity":1000.00000000,"Rate":0.00000955}
import pandas as pd
#change 'buy' for other results
data = pd.DataFrame(pd.read_json('file.json')['result']['buy'])
#for filtering
print(data.query('Quantity > 5').query('Rate > 0.00000966').sum())
You can use the pandas.read_json() command to do this. Just pass it your json file in the function and pandas will create a dataframe out of it for you.
Here's the link to the documentation for it where you can pass extra parameters like orient='records' and so on to tell pandas on what to use as the dataframe columns and what to use as row data etc.
Here's the link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Once in the dataframe, then you can run various commands to calculate sums of quantity for buy and sell. Having the data in a dataframe makes life a bit more easier when running math calculations in my opinion.
Use json_normalize and pass the meta path :
import json
import pandas as pd
with open('data.json') as f:
data = json.load(f)
buy_df = pd.io.json.json_normalize(data['result'],'buy')
#Similarly for sell data if you have a separte entity named `sell`.
sell_df = pd.io.json.json_normalize(data['result'],'sell')
Output:
Quantity Rate
0 199538.309487 0.00001
1 62142.317154 0.00001
2 233476.034861 0.00001
3 75613.308799 0.00001
4 3109.149614 0.00001
For sum you can do
buy_df['Quantity'].sum()
From now for selection and indexing the data refer this - Indexing and Selecting Data - Pandas
Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')