create a bigram from a column in pandas df - python

i have this test table in pandas dataframe
Leaf_category_id session_id product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321
this is the extension of my previous question, which was answered by #jazrael.
view answer
so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,
|product_id |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |
and so on,
i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:
|product_id |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)
ignore the **( i'll tell you later on why i starred that)
the code to achive the bigram is
for i in df.Leaf_category_id.unique():
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())
from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.
Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2.
similar is the case with (34,12) it occurs two times, so frequency should be 2
|Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)
Final Result should be.
|Bigram | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1
i am hoping to find answer here, please help me, i have elaborated it as much as possible.

try this code
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()
df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]
for combinations (all possible bi-grams)
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()
df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]
where data.csv contains
Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12
The resultant bigram_frequency_consecutive will be
Bigram freq
0 (12, 34) 2
1 (12, 324) 1
2 (12, 654) 1
3 (12, 987) 1
4 (32, 342) 1
5 (34, 87) 1
6 (34, 741) 1
7 (87, 741) 1
8 (111, 741) 1
9 (123, 987) 1
10 (321, 321) 1
11 (321, 741) 1
12 (324, 465) 1
13 (324, 654) 1
14 (324, 987) 1
15 (342, 465) 1
16 (654, 862) 1
17 (741, 987) 2
18 (987, 1232) 1
The resultant bigram_frequency_combinations will be
Bigram freq
0 (12, 32) 1
1 (12, 34) 2
2 (12, 87) 1
3 (12, 111) 1
4 (12, 123) 1
5 (12, 324) 1
6 (12, 342) 1
7 (12, 465) 1
8 (12, 654) 1
9 (12, 741) 2
10 (12, 987) 2
11 (32, 324) 1
12 (32, 342) 1
13 (32, 465) 1
14 (32, 654) 1
15 (34, 87) 1
16 (34, 111) 1
17 (34, 123) 1
18 (34, 741) 2
19 (34, 987) 2
20 (87, 111) 1
21 (87, 741) 1
22 (87, 987) 1
23 (111, 741) 1
24 (111, 987) 1
25 (123, 741) 1
26 (123, 987) 1
27 (321, 321) 1
28 (321, 324) 2
29 (321, 654) 2
30 (321, 741) 2
31 (321, 862) 2
32 (321, 987) 2
33 (324, 342) 1
34 (324, 465) 1
35 (324, 654) 2
36 (324, 741) 1
37 (324, 862) 1
38 (324, 987) 1
39 (342, 465) 1
40 (342, 654) 1
41 (465, 654) 1
42 (654, 741) 1
43 (654, 862) 1
44 (654, 987) 1
45 (741, 862) 1
46 (741, 987) 3
47 (862, 987) 1
48 (987, 1232) 1
in the above case it groups by both

We are going to pull out the values from product_id, create bigrams that are sorted and thus deduplicated, and count them to get the frequency, and then populate a data frame.
from collections import Counter
# assuming your data frame is called 'df'
bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()]
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x]
freq_dict = Counter(bigram_set)
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq'])

Related

Pandas groupby using range and type

I have a Dataframe where I have "room_type" and "review_scores_rating" as labels
The dataframe looks like this
room_type review_scores_rating
0 Private room 98.0
1 Private room 89.0
2 Entire home/apt 100.0
3 Private room 99.0
4 Private room 97.0
I already use groupby so I also have this dataframe
review_scores_rating
room_type
Entire home/apt 11930
Hotel room 97
Private room 3116
Shared room 44
I want to create a dataframe where I have as columns the different room types and each row counts how many are in for different ranges of the rating
I was able to get to this point
count
review_scores_rating
(19.92, 30.0] 24
(30.0, 40.0] 23
(40.0, 50.0] 9
(50.0, 60.0] 97
(60.0, 70.0] 74
(70.0, 80.0] 486
(80.0, 90.0] 1701
(90.0, 100.0] 12773
But I donĀ“t know how to make it count not only by range of the score but also for room type so I can now for example how many private room have a review score rating between 30 and 40
You can use a crosstab with cut:
pd.crosstab(pd.cut(df['review_scores_rating'], bins=range(0, 101, 10)),
df['room_type'])
Output:
room_type Entire home/apt Private room
review_scores_rating
(80, 90] 0 1
(90, 100] 1 3
Or groupby.count:
df.groupby(['room_type', pd.cut(df['review_scores_rating'], bins=range(0, 101, 10))]).count()
Output:
review_scores_rating
room_type review_scores_rating
Entire home/apt (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 0
(90, 100] 1
Private room (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 1
(90, 100] 3

How to add column for every month and generate number i.e. 1,2,3..etc

I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]

subtract two columns in a data frame if they have the same ending in a loop

If my data looks like this
Index Country ted_Val1 sam_Val1 ... ted_Val10 sam_Val10
1 Australia 1 3 ... 20 5
2 Bambua 12 33 ... 15 56
3 Tambua 14 34 ... 10 58
df = pd.DataFrame([["Australia", 1, 3, 20, 5],
["Bambua", 12, 33, 15, 56],
["Tambua", 14, 34, 10, 58]
], columns=["Country", "ted_Val1", "sam_Val1", "ted_Val10", "sam_Val10"]
)
I'd like to subtract all 'val_' columns from all 'ted_' values using a list, creating a new column starting with 'dif_' such that:
Index Country ted_Val1 sam_Val1 diff_Val1 ... ted_Val10 sam_Val10 diff_val10
1 Australia 1 3 -2 ... 20 5 -15
2 Bambua 12 33 12 ... 15 56 -41
3 Tambua 14 34 14... 10 58 -48
so far I've got:
calc_vars = ['ted_Val1',
'sam_Val1',
'ted_Val10',
'sam_Val10']
for i in calc_vars:
df_diff['dif_' + str(i)] = df.['ted_' + str(i)] - df.['sam_' + str(i)]
but I'm getting errors, not sure where to go from here. As a warning this is dummy data and there can be several underscores in the names
IIUC you can use filter to choose the columns for subtraction (assuming your columns are properly sorted like your sample):
print (pd.concat([df, pd.DataFrame(df.filter(like="ted").to_numpy()-df.filter(like="sam").to_numpy(),
columns=["diff"+i.split("_")[-1] for i in df.columns if "ted_Val" in i])],1))
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff1 diff10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
try this,
calc_vars = ['ted_Val1', 'sam_Val1', 'ted_Val10', 'sam_Val10']
# extract even & odd values from calc_vars
# ['ted_Val1', 'ted_Val10'], ['sam_Val1', 'sam_Val10']
for ted, sam in zip(calc_vars[::2], calc_vars[1::2]):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Edit: if columns are not sorted,
ted_cols = sorted(df.filter(regex="ted_Val\d+"), key=lambda x : x.split("_")[-1])
sam_cols = sorted(df.filter(regex="sam_Val\d+"), key=lambda x : x.split("_")[-1])
for ted, sam in zip(ted_cols, sam_cols):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff_Val1 diff_Val10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48

How do a join two columns into another seperate column in Pandas?

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.
Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]
I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)
Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

Create plot with Pandas and show similar output as with Matplotlib directly

I have a query that I run that outputs a list of data consisting of a date string and a count:
date_cnts = [(u'2014-06-27', 1),
(u'2014-06-29', 3),
(u'2014-06-30', 1),
(u'2014-07-01', 1),
(u'2014-07-02', 1),
(u'2014-07-09', 1),
(u'2014-07-10', 3),
(u'2014-07-11', 1),
(u'2014-07-12', 2),
(u'2014-07-14', 1),
(u'2014-07-15', 2),
(u'2014-07-17', 3),
(u'2014-07-18', 1),
(u'2014-07-20', 1),
(u'2014-07-21', 1),
(u'2014-07-23', 2),
(u'2014-07-26', 2),
(u'2014-07-27', 2),
(u'2014-07-28', 7),
(u'2014-07-29', 3),
(u'2014-07-31', 2),
(u'2014-08-01', 1),
(u'2014-08-05', 4),
(u'2014-08-07', 2),
(u'2014-08-08', 1),
(u'2014-08-13', 1),
(u'2014-08-14', 3),
(u'2014-08-15', 1),
(u'2014-08-16', 6),
(u'2014-08-17', 1),
(u'2014-08-18', 1),
(u'2014-08-20', 1),
(u'2014-08-24', 1),
(u'2014-08-25', 3),
(u'2014-08-29', 1),
(u'2014-08-30', 1),
(u'2014-09-03', 3),
(u'2014-09-13', 1),
(u'2014-09-14', 1),
(u'2014-09-24', 3),
(u'2014-10-20', 1),
(u'2014-10-24', 1),
(u'2014-11-05', 3),
(u'2014-11-09', 1),
(u'2014-11-12', 1),
(u'2014-11-13', 1),
(u'2014-11-14', 1),
(u'2014-11-18', 1),
(u'2014-11-19', 4),
(u'2014-11-22', 1),
(u'2014-11-26', 3),
(u'2014-11-28', 3),
(u'2014-12-01', 2),
(u'2014-12-02', 2),
(u'2014-12-04', 2),
(u'2014-12-05', 1),
(u'2014-12-06', 5),
(u'2014-12-11', 1),
(u'2014-12-15', 10)]
Notice that there are date gaps in this data set, indicating that the missing dates have a value of 0.
My working (non-Pandas) version of code looks like this:
from matplotlib import pyplot as plt
x_val = [datetime.strptime(x[0],'%Y-%m-%d') for x in date_cnts]
y_val = [x[1] for x in date_cnts]
plt.bar(x_val, y_val)
plt.grid(True)
plt.show()
This outputs this image:
Now, if I convert my query results to a Panda's dataframe
Date Count
0 2014-06-27 1
1 2014-06-29 3
2 2014-06-30 1
3 2014-07-01 1
4 2014-07-02 1
5 2014-07-09 1
6 2014-07-10 3
7 2014-07-11 1
8 2014-07-12 2
9 2014-07-14 1
10 2014-07-15 2
11 2014-07-17 3
12 2014-07-18 1
13 2014-07-20 1
14 2014-07-21 1
15 2014-07-23 2
16 2014-07-26 2
17 2014-07-27 2
18 2014-07-28 7
19 2014-07-29 3
20 2014-07-31 2
21 2014-08-01 1
22 2014-08-05 4
23 2014-08-07 2
24 2014-08-08 1
25 2014-08-13 1
26 2014-08-14 3
27 2014-08-15 1
28 2014-08-16 6
29 2014-08-17 1
30 2014-08-18 1
31 2014-08-20 1
32 2014-08-24 1
33 2014-08-25 3
34 2014-08-29 1
35 2014-08-30 1
36 2014-09-03 3
37 2014-09-13 1
38 2014-09-14 1
39 2014-09-24 3
40 2014-10-20 1
41 2014-10-24 1
42 2014-11-05 3
43 2014-11-09 1
44 2014-11-12 1
45 2014-11-13 1
46 2014-11-14 1
47 2014-11-18 1
48 2014-11-19 4
49 2014-11-22 1
50 2014-11-26 3
51 2014-11-28 3
52 2014-12-01 2
53 2014-12-02 2
54 2014-12-04 2
55 2014-12-05 1
56 2014-12-06 5
57 2014-12-11 1
58 2014-12-15 10
And utilize the simple Panda's wrapper to plot this:
plt.figure()
df.plot(kind='bar', grid=True, legend=False, x='Date', y=u'Count')
plt.show()
I get this result. Notice that my missing days do not appear in this graph.
How do I readd the gaps (and 0 values) where my dates do not exist in the DataFrame?
The reason I want to utilize Pandas is to take advantage of some of it's other features (most importantly, a rolling average).
I wrote a working version, probably not the best but it will do the job. It is based on reindexing your original data into a DataFrame with a sample for everyday.
import pandas as pd
import matplotlib.pyplot as plt
#%% make data
df = pd.DataFrame(date_cnts)
df.columns = ['Date', 'Count']
#%% make dataframe with everyday sampling
df.index = pd.to_datetime(df['Date'])
startdate = df.index[0]
enddate = df.index[-1]
df_new = df.reindex(pd.date_range(startdate, enddate, freq='1D'))
#%% plot the results
df_new['Count'].plot(kind='bar')
# decrease number of days
new_xticks = plt.xticks()[0][1:-1:10]
plt.xticks(new_xticks)
For further formatting of the xticks I recommend this question: Pandas timeseries plot setting x-axis major and minor ticks and labels

Categories