Creation Pivot table for difficult dataSet python - python

Please help me to understand how can I create Pivot_table or group by for the difficult dataSet.
I tried to create pivot_table:
grouped_table = pd.pivot_table(renamedDf,index=["date","date_1","date_2","date_3", values = col_list] ,aggfunc=np.sum)
I received:
File "<ipython-input-107-c87c2a9a3325>", line 1
grouped_table = pd.pivot_table(renamedDf,index=["date","date_1","date_2","date_3", values = col_list] ,aggfunc=np.sum)
^SyntaxError: invalid syntax
Dataset has the following structure:
DataSet_screenshot
Expected structure:
enter image description here
Thank you in advance for any suggestions!

you have a syntax error because you didn't close your bracket after index list move it like this.
grouped_table = pd.pivot_table(renamedDf,index=["date","date_1","date_2","date_3"], values = col_list ,aggfunc=np.sum)

Related

Facing some problems in groupby function for Outlier Removal

I am working on a data cleaning project and in this, I have to remove some outliers of price_per_sqft.. So I used groupby function and by statistic, the formula creates a data frame without outliers and concat it with the output data frame...
But in the output this type of word returns with the location names so how can I get a clean location name instead of this..?
Code:
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df6 = remove_pps_outliers(df5)
df6.head()
Output:
enter image description here
How can I get the answer without "1st Phase" or "1st Block" keywords like this...
enter image description here
A rudimentary fix would be to just replace the characters you do not want. Luckily in this example, both '1st Phase ' and '1st Block ' contain 10 characters so you could use :
df6['location'] = df6['location'].str.slice_replace(0,10,'')

KeyError: "['C18orf17', 'UHRF1', 'OLR1', 'TBC1D2', 'AXUD1'] not in index"

I have seen this error here. But my problem is not that.
I am trying to extract some column of large dataframe:
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
It throws an error as follows:
KeyError: "['C18orf17', 'UHRF1', 'OLR1', 'TBC1D2', 'AXUD1'] not in index"
After removing the above columns it started working. fine
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", "TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
But, I want ignore this error by not considering the column names if not present and consider which overlap. Any help appreciated..
Use Index.intersection for select only columns with list if exist:
L = ["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]
dfx = df1[df1.columns.intersection(L, sort=False)]
Or filter them in Index.isin, then need DataFrame.loc with first : for select all rows and columns by mask:
dfx = df1.loc[:, df1.columns.isin(L)]

Need advise in merging dataframes

I am having 2 data frames as below. I would like to merge the two dataframes and display the match schedule with Match ID and Team names.
df_team = pd.DataFrame({'Team_ID':['1','2','3','4'],'Team_Name':['CSK','KKR','MI','RCB']})
df_match = pd.DataFrame({'Match_id':['01','02','03','04'],'Team_ID':['1','2','3','4'],'Opponent_Team_Id':['2','3','4','1']})
display(df_team)
display(df_match)
Output
I got the expected output using the below code. However, I would like to know whether this could be simplified further. Please advise. Thanks in advance.
Merge to get Home team's name
df_match_schedule = pd.merge(df_match,df_team,left_on='Team_ID',right_on='Team_ID')
df_match_schedule.rename(columns = {'Team_Name':'Home Team'}, inplace = True)
Merge to get Away team's name
df_match_schedule_2 = pd.merge(df_match_schedule,df_team,left_on='Opponent_Team_Id',right_on='Team_ID').drop('Team_ID_y',axis=1)
df_match_schedule_2.rename(columns = {'Team_Name':'Away Team'}, inplace = True)
Display the match schedule
df_match_schedule_2[['Match_id','Home Team','Away Team']]
Final Output
This is mapping the Team IDs onto the name.
df_team = df_team.set_index("Team_ID")
df_match.loc[:,"Team_ID"] = df_match["Team_ID"].apply(lambda i: df_team.loc[i,"Team_Name"])
df_match.loc[:,"Opponent_Team_Id"] = df_match["Opponent_Team_Id"].apply(lambda i: df_team.loc[i,"Team_Name"])
(On a side note, your provided code does not work. The second merge.)
Also, when merging and both dataframes have the same column names you can use the keyword "on" instead of "right_on" and "left_on".

Fetching 2 columns from Tuple using Python

I have a tuple which looks like this when I iterate through its rows:
for row in df.itertuples(index=False, name=None):
print(row)
o/p :
(100214, '120.6843686', '-41.9098438')
(101105, '121.7692179', '-42.2737880')
(101847, '122.6417215', '-43.8718865')
Output Desired:
('120.6843686', '-41.9098438')
('121.7692179', '-42.2737880')
('122.6417215', '-43.8718865')
I am new to Python, so any help would really be appreciated.
Thanks..
Use the following code:
for row in df.itertuples(index=False, name=None):
print(row[1:])
This slices the tuple and displays everything after column 0. This article explains it in further detail if you're interested.
If you are just trying to get values here's a simple way:
import pandas as pd
df = pd.DataFrame((
(100214, '120.6843686', '-41.9098438'),
(101105, '121.7692179', '-42.2737880'),
(101847, '122.6417215', '-43.8718865'))
)
df = df.iloc[:, 1:].values.tolist()
print(df)
[['120.6843686', '-41.9098438'],
['121.7692179', '-42.2737880'],
['122.6417215', '-43.8718865']]

removing rows with given criteria

I am beginer with both python and pandas and I came across an issue I can't handle on my own.
What I am trying to do is:
1) remove all the columns except three that I am interested in
2) remove all rows which contains serveral strings in column "asset number". And here is difficult part. I removed all the blanks but I can't remove other ones because nothing happens (example with string "TECHNOLOGIES" - tried part of the word and whole word and both don't work.
Here is the code:
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19')
df = df[['asset number','Cost','accumulated depr']] #removing other columns
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace = False)
df = df[~df['asset number'].str.contains("TECHNOLOGIES, INC", na=False)]
df.to_excel("abi_output.xlsx")
And besides that, file has 600k rows and it loads so slow to see the output. Do you have any advice for it?
Thank you!
#Kenan - thank you for your answer. Now the code looks like below but it still doesn't remove rows which contains in chosen column specified stirngs. I also attached screenshot of the output to show you that the rows still exist. Any thoughts?
import modin.pandas as pd
File1 = 'abi.xlsx'
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
several_strings = ['', 'TECHNOLOGIES', 'COST CENTER', 'Account', '/16']
df = df[~df['asset number'].isin(several_strings)]
df.to_excel("abi_output.xlsx")
rows still are not deleted
#Andy
I attach sample of the input file. I just changed the numbers in two columns because these are confidential and removed not needed columns (removing them with code wasn't a problem).
Here is the link. Let me know if this is not working properly.
enter link description here
You can combine your first two steps with:
df = pd.read_excel(File1, sheet_name = 'US JERL Dec-19', usecols=['asset number','Cost','accumulated depr'])
I assume this is what your trying to remove
several_strings = ['TECHNOLOGIES, INC','blah','blah']
df = df[~df['asset number'].isin(several_string)]
df.to_excel("abi_output.xlsx")
Update
Based on the link you provided this might be a better approach
df = df[df['asset number'].str.len().eq(7)]
the code your given is correct. so I guess maybe there is something wrong with your strings in columns 'asset number', can you give some examples for a code check?

Categories