I have a problem creating a Pandas Dataframe with multi indexing issue. In the data below, you will see that its the data for 2 banks, and each bank has 2 assets and each asset has 3 features.
My data is similarly structured and I want to create a dataframe out of this.
Data = [[[2,4,5],[3,4,5]],[[6,7,8],[9,10,11]]]
Banks = ['Bank1', 'Bank2']
Assets = ['Asset1', 'Asset2']
Asset_feature = ['private','public','classified']
I have tried various ways to do this but I've always failed to create an accurate dataframe. The result should look something like this:
Asset1 Asset2
private public classified private public classified
Bank1 2 4 5 3 4 5
Bank2 6 7 8 9 10 11
Any help would be much appreciated.
import pandas as pd
import numpy as np
assets = ['Asset1', 'Asset2']
Asset_feature = ['private','public','classified']
Banks = ['Bank1', 'Bank2']
Data = [[[2,4,5],[3,4,5]],[[6,7,8],[9,10,11]]]
Data = np.array(Data).reshape(len(Banks),len(Asset_feature) * len(assets))
midx = pd.MultiIndex.from_product([assets, Asset_feature])
test = pd.DataFrame(Data, index=Banks, columns=midx)
test
which gives this output
Asset1 Asset2
private public classified private public classified
Bank1 2 4 5 3 4 5
Bank2 6 7 8 9 10 11
Related
I would like for each group in a data frame df_task containing three rows, to modify the second row of the column Task.
import pandas as pd
df_task = pd.DataFrame({'Days':[5,5,5,20,20,20,10,10],
'Task':['Programing','Presentation','Training','Development','Presentation','Workshop','Coding','Communication']},)
df_task.groupby(["Days"])
This is the expected output, if the group contain three rows, the value of task from the first row is added to the value of Task from the second row, as shown in the new column New_Task, if the group has two rows, nothing is modified:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication
Your requirement are pretty straight-forward. Try:
groups = df_task.groupby('Days')
# enumeration of the rows within groups
enums = groups.cumcount()
# sizes of the groups broadcast to each row
sizes = groups['Task'].transform('size')
# so update the correct rows
df_task['New_Task'] = np.where(enums.eq(1) & sizes.gt(2),
df_task['Task'] + ',' + groups['Task'].shift(fill_value=''),
df_task['Task'])
print(df_task)
Output:
Days Task New_Task
0 5 Programing Programing
1 5 Presentation Presentation,Programing
2 5 Training Training
3 20 Development Development
4 20 Presentation Presentation,Development
5 20 Workshop Workshop
6 10 Coding Coding
7 10 Communication Communication
I have the following dataframe:
ID mutex add atomic add cas add ys_add blocking ticket queued fifo
Cores
1 21.0 7.1 12.1 9.8 32.2 44.6
2 121.8 40.0 119.2 928.7 7329.9 7460.1
3 160.5 81.5 227.9 1640.9 14371.8 11802.1
4 188.9 115.7 347.6 1945.1 29130.5 15660.1
There is both a column index (ID) and a row index (Cores). When I use DataFrame.to_html(), I get a table like this:
Instead, I'd like a table with a single header row, composed of all the column names (but without the column index name ID) and with the row index name Cores in that same header row, like so:
I'm open to manipulating the dataframe prior to the to_html() call, or adding parameters to the to_html() call, but not messing around with the generated html.
Initial setup:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],
columns = ['attr_a', 'attr_b', 'attr_c', 'attr_c'])
df.columns.name = 'ID'
df.index.name = 'Cores'
df
ID attr_a attr_b attr_c attr_c
Cores
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Then set columns.name to 'Cores', and index.name to None. df.to_html() should then give you the output you want.
df.columns.name='Cores'
df.index.name = None
df.to_html()
Cores attr_a attr_b attr_c attr_c
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
I have been trying to obtain the Statewise sheet from this public googlesheet link as a python dataframe.
The URL of this sheet is different from the URL of other examples for achieving the goal of getting sheet as dataframe, seen on this website.
URL is this :
https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml#
One standard way maybe the following,
import pandas
googleSheetId = '<Google Sheets Id>'
worksheetName = '<Sheet Name>'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
worksheetName
)
df = pandas.read_csv(URL)
print(df)
But in the present URL I do not see a structure used here. Can someone help to clarify. Thanks.
The Google spreadsheet is actually an html thing. So you should use read_html to load it into a list of pandas dataframes:
dfs = pd.read_html(url, encoding='utf8')
if lxml is available or, if you use BeautifulSoup4:
dfs = pd.read_html(url, flavor='bs4', encoding='utf8')
You will get a list of dataframes and for example dfs[0] is:
0 1 2 3
0 1 id Banner Number_Of_Times
1 2 1 Don't Hoard groceries and essentials. Please e... 2
2 3 2 Be compassionate! Help those in need like the ... 2
3 4 3 Be considerate : While buying essentials remem... 2
4 5 4 Going out to buy essentials? Social Distancing... 2
5 6 5 Plan ahead! Take a minute and check how much y... 2
6 7 6 Plan and calculate your essential needs for th... 2
7 8 7 Help out the elderly by bringing them their gr... 2
8 9 8 Help out your workers and domestic help by not... 2
9 10 9 Lockdown means LOCKDOWN! Avoid going out unles... 1
10 11 10 Panic mode : OFF! ❌ESSENTIALS ARE ON! ✔️ 1
11 12 11 Do not panic! ❌ Your essential needs will be t... 1
12 13 12 Be a true Indian. Show compassion. Be consider... 1
13 14 13 If you have symptoms and suspect you have coro... 1
14 15 14 Stand Against FAKE News and WhatsApp Forwards!... 1
15 16 15 If you have any queries, Reach out to your dis... 1
You can use the following snippet:
from io import BytesIO
import requests
r = requests.get(URL)
data = r.content
df = pd.read_csv(BytesIO(data), index_col=0, error_bad_lines=False)
I have a dataFrame really similar to that, but with thousands of values :
import numpy as np
import pandas as pd
# Setup fake data.
np.random.seed([3, 1415])
df = pd.DataFrame({
'Class': list('AAAAAAAAAABBBBBBBBBB'),
'type': (['short']*5 + ['long']*5) *2,
'image name': (['image01']*2 + ['image02']*2)*5,
'Value2': np.random.random(20)})
I was able to find a way to do a random sampling of 2 values per images, per Class and per Type with the following code :
df2 = df.groupby(['type', 'Class', 'image name'])[['Value2']].apply(lambda s: s.sample(min(len(s),2)))
I got the following result :
I'm looking for a way to subset that table to be able to randomly choose a random image ('image name') per type and per Class (and conserve the 2 values for the randomly selected image.
Excel Example of my desired output :
IIUC, the issue is that you do not want to groupby the column image name, but if that column is not included in the groupby, your will lose this column
You can first create the grouby object
gb = df.groupby(['type', 'Class'])
Now you can interate over the grouby blocks using list comprehesion
blocks = [data.sample(n=1) for _,data in gb]
Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe
pd.concat(blocks)
Output
Class Value2 image name type
7 A 0.817744 image02 long
17 B 0.199844 image01 long
4 A 0.462691 image01 short
11 B 0.831104 image02 short
OR
You can modify your code and add the column image name to the groupby like this
df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))
Value2 image name
type Class
long A 8 0.777962 image01
9 0.757983 image01
B 19 0.100702 image02
15 0.117642 image02
short A 3 0.465239 image02
2 0.460148 image02
B 10 0.934829 image02
11 0.831104 image02
EDIT: Keeping image same per group
Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this
import random
gb = df.groupby(['Class','type'])
ls = []
for index,frame in gb:
ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))
pd.concat(ls)
Output
Class Value2 image name type
6 A 0.850445 image02 long
7 A 0.817744 image02 long
4 A 0.462691 image01 short
0 A 0.444939 image01 short
19 B 0.100702 image02 long
15 B 0.117642 image02 long
10 B 0.934829 image02 short
14 B 0.721535 image02 short
Data example (not real data) can be also seen here I have a data set of 3x500 with columns names of : Job Level (numerical), Job Code (Categorical) and Stock Value (Numerical). I am using linear regression to fit the Stock values based on Job Levels, grouped by Job Code.
For example:
Job Code Job Level Job Title Stock Value
20 1 Production Engineer
20 2 Production Engineer
20 3 Production Engineer 6,985
20 4 Production Engineer 7,852
20 5 Production Engineer
30 1 Production Engineer
30 2 Logistics Analyst
30 3 Logistics Analyst 4,962
30 4 Logistics Analyst 22,613
30 5 Logistics Analyst 31,689
40 1 Logistics Analyst
Here is what I have done. How can I get to see my data set columns (original data) with the predicted values added. Right now I only can see the prediction. I can not join them together because:
Here is the situation: When I first start my code my df_nonull.shape = (268,4) then after the for loop my df_nonull.shape = (4,4) and then df_results.shape = (89,2). As a result, I am not able to join them.
> import pandas as pd from sklearn.linear_model import LinearRegression
> df = pd.read_excel("stats.xlsx")
> df_nonull=df.dropna()
>
> model= LinearRegression() groups = [] results = [] level = []
>
> for (group, df_nonull) in df_nonull.groupby('Job Code'):
> X=df_nonull[['Job Level']]
> y=df_nonull[['Stock Value']]
> model.fit(X,y)
> coefs = list(zip(X.columns, model.coef_))
> results.append(model.predict(735947)[0])
> groups.append(group)
>
> df_results = pd.DataFrame({'Job Code':groups, 'prediction':results})
>
> print df_results.head(50)
Just FYI, my main goal here is running a regression model in the data set where there is no NaN (df_nonull), and applying the linear regression coefficients to the entire data (for Stock Values,y) (df). This has nothing to do with what I am asking but wanted to give some backround info about why I am pursuing this.
Assuming you have consistent index for the input data and the prediction series. I think what you need is pd.concat.
import pandas as pd
>>> X = pd.DataFrame({'input': [i for i in range(10)]}) ## fake input data
>>> pred = pd.DataFrame({'prediction':[i-5 for i in range(10)]}) ## fake prediction data
>>> pd.concat([X, pred], axis=1)
input prediction
0 0 -5
1 1 -4
2 2 -3
3 3 -2
4 4 -1
5 5 0
6 6 1
7 7 2
8 8 3
9 9 4
I would recommend the pandas (0.20.1) specifically this section on concatenation.
You can use then following command to create one single data frame containing the data set values and the predicted values.
df_nonull.join(df_results,how="outer")