Issues creating table plot of column pandas DataFrame data?

Issues creating table plot of column pandas DataFrame data? - python

I have the following code which creates a table image with the column names labelled. The issue I am having is getting the columns (dc[x]) to be able to populate the table vertically as opposed to horizontally.
def drilltable():
c = readcsv3()
dc = DataFrame(c)
topA,ARA,PWA, = dc[0],dc[1],dc[2],
data=[topA,ARA,PWA]
fig = plt.figure()
ax = fig.add_subplot(111,axisbg='white')
ax.axis('off')
ax.set_aspect(.2)
cols=["Top Drillers in Alberta", "Active Rigs", "Prev Week"]
table = ax.table(cellText=data, colLabels=cols, loc='upper center',cellLoc='center',colWidths=[.075]*18)
for cidx in table._cells:
table._cells[cidx].set_facecolor('#EEECE1')
table.set_fontsize(20)
table.scale(2.1,16)
plt.savefig(filenameTemplate3, format='png',bbox_inches='tight')
The below image is what I am currently getting, which has the correct column labels but the values are listed horizontally instead of below each col label.
Instead, I would like something that looks more like this (did in excel, because I cant get it to work in python):
The issue is that in the when I set the ax.table() function, cellText=data assigns the cells horizontally, I would like it vertically as the data is in that format.
Any suggestions?
Here is an example of data I am trying to read into the table (actual file is 11x12), via CSV file.
Top Dillers AB Active Rigs Prev. Week
Tourmaline Oil Corp 10 9
CNRL 8 8
Seven Generations 7 10
Encana Corp 6 7
Peyto Exploration 5 6

Related

GeoPandas plot shapefile by ignoring some administrative areas

Shapefile Data: The entire world (with 5 administrative areas) from https://gadm.org/data.html
import geopandas as gpd
World = gpd.read_file("~/gadm36.shp")
World=World[['NAME_0','NAME_1','NAME_2','geometry']] #Keep only 3 columns
World.head()
In this GeoDataFrame, I have 60 columns (NAME_0: for country name, NAME_1 for the region, ...)
For now, I am interested in studying the number of users of my website in Germany
Germany=World[World['NAME_0'].isin(['Germany']) == True]
Now here my website users data by region (NAME_1), I renamed the first column to be the same in shapefile
GER = pd.read_csv("~/GER.CSV",sep=";")
GER
Now I merge my data to GeoDataFrame on NAME_1 to plot users in regions
merged_ger = Germany.merge(GER, on = 'NAME_1', how='left')
merged_ger['Users'] = merged_ger['Users'].fillna(0)
The problem here is that NAME_1 is repeated according to NAME_2. Thus, the total number of users in the merged data greatly exceeds the original number
print(merged_ger['Users'].sum())
print(GER['Users'].sum())
7172411.0
74529
So plot data using this code
import matplotlib.pyplot as plt
merged_ger.plot(column='Users')
is obviously wrong
How can I merge the data in this case without duplication and without affecting the final plot?
Or, how do I ignore the rest of the administrative areas in a shapefile?

Wouldn't mapping a dictionary of user's region help?
GER_users = dict(zip(GER.NAME_1, GER.Users))
Germany['Users'] = Germany['NAME_1'].map(GER_users)

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256

Try assign the index back
r.index = Stocks_Open.index

Sorting in pandas by multiple column without distorting index

I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50

How to group data in a DataFrame and also show the number of row in that group?

first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.

There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.

Big data visualization for multiple sampled data points from a large log

I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable:
2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...]
2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ]
in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish'
so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it.
Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.

Getting the data into pandas:
import pandas as pd
df = pd.DataFrame(columns = ['timestamp','name','value'])
with open(logfilepath) as f:
for line in f.readlines():
timestamp = line.split(',')[0]
#the data part of each line can be evaluated directly as a Python list
data = eval(line.split('=')[1])
#convert the input data from wide format to long format
for name, value in data:
df = df.append({'timestamp':timestamp, 'name':name, 'value':value},
ignore_index = True)
#convert from long format back to wide format, and fill null values with 0
df2 = df.pivot_table(index = 'timestamp', columns = 'name')
df2 = df2.fillna(0)
df2
Out[142]:
value
name ball cod fish keypassed mouse toy
timestamp
2015-06-20 16:42:48 2 0 0 13 1 2
2015-06-21 16:42:48 7 1 1 20 0 5
Plot the data:
import matplotlib.pylab as plt
df2.value.plot()
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues creating table plot of column pandas DataFrame data? - python

Related

GeoPandas plot shapefile by ignoring some administrative areas

How to use one dataframe's index to reindex another one in pandas

Sorting in pandas by multiple column without distorting index

How to group data in a DataFrame and also show the number of row in that group?

Big data visualization for multiple sampled data points from a large log

Categories

Resources