Pandas dataframe sorting issue - python
I have an excel spreadsheet I create a data frame from. When I run my code I can't get the dataframe to sort correctly on port_count. I'm trying to make it sort on port_count and then display the port that are open for the ip address. This code is almost there, but sorting is giving me a problem.
import pandas as pd
import openpyxl as xl;
data = {'IP': ['192.168.1.1','192.168.1.1','192.168.1.1','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','192.168.5.3','192.168.5.3','192.168.4.6','192.168.4.6','192.168.4.7','192.168.4.7','192.168.8.9','192.168.8.9','10.10.2.3','10.10.2.3','10.5.2.3','10.5.2.3','10.1.2.3','10.1.2.3','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','4.5.6.7','4.5.6.7','4.5.6.7','4.5.6.7','4.5.6.7','192.168.9.10','192.168.9.10','192.168.9.11','192.168.9.11','192.168.9.12','192.168.9.12','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','8.9.10.11','8.9.10.11','8.9.10.11','2.8.3.9','2.8.3.9','12.13.14.15','13.14.15.16','13.14.15.16','74.208.236.41','74.208.236.41','74.208.236.41','3.234.139.2','3.234.139.2','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','1.2.3.6','192.168.9.6','192.168.9.6','172.16.54.65','172.16.54.65','172.16.54.65','172.16.54.65','172.16.54.66','172.16.54.66','172.16.54.66','172.16.85.36','172.16.85.36','10.10.12.12','10.10.12.12'],
'Port': ['22','80','443','80','443','2082','2083','2086','2087','2095','8080','8443','8880','80','443','80','443','80','443','80','443','80','443','80','443','80','443','21','22','25','80','110','143','443','465','587','993','995','2082','2086','2087','2096','3306','25','80','443','465','587','80','443','80','443','80','443','80','443','2052','2053','2082','2083','2086','2087','2096','8080','8443','8880','5222','8008','8443','80','443','80','80','443','80','81','443','80','443','80','443','2082','2083','2087','8443','8880','80','443','2052','2053','2082','2083','2086','2087','2096','8080','8443','8880','80','80','443','80','82','83','443','80','82','443','80','443','80','443'],
}
df = pd.DataFrame(data)
df['port_count'] = df.groupby('IP')['Port'].transform('count')
df['port_count'] = df['port_count'].astype(int)
df.sort_values(by=['port_count'], ascending=False, inplace=True)
pivot1 = df.pivot_table(df, index=['IP', 'Port'], columns=None, fill_value=0).sort_values(by='port_count', ascending=False)
if df.size != 0:
with pd.ExcelWriter("/testing/test.xlsx", mode="a", engine="openpyxl", if_sheet_exists='replace') as writer:
pivot1.to_excel(writer,sheet_name="IP to Port")
Current output looks like this:
https://www.hopticalillusion.co/shared-files/730/?test_output.xlsx
Desired Output:
https://www.hopticalillusion.co/shared-files/731/?desired_test_output.xlsx
Maybe try the following:
df['port_count'] = df['port_count'].astype(int)
df.sort_values(by=['port_count'], ascending=False, inplace=True)
Related
Pandas dataframe throwing error when appending to CSV
` import pandas as pd df = pd.read_csv("stack.csv") sector_select = "Col2" df[sector_select] = ["100"] df.to_csv("stack.csv", index=False, mode='a', header=False) ` stack.csv has no data other than a header: Col1,Col2,Col3,Col4,Col5 ValueError: Length of values (1) does not match length of index (2) Im just trying to make a program where I can select a header and append data to the column under that header You can only run it twice until it gives an error!
You can use this: df = df.append({"Col2": 100}, ignore_index=True)
That code runs for me. But I assume that you would like to run something like this: import pandas as pd df = pd.read_csv("stack.csv") sector_select = "Col2" df.at[len(df), sector_select] = "100" df.to_csv("stack.csv", index=False)
AttributeError: 'ExcelFile' object has no attribute 'dropna'
I was trying to remove the empty column in an excel file using pandas using dropna() method. But I ended up with the above error message. Please find my code below : import pandas as pd df = pd.ExcelFile("1.xlsx") print(df.sheet_names) #df.dropna(how='all', axis=1) newdf = df.dropna()
Please provide more code and context, but this might help: import pandas as pd excel_file_name = 'insert excel file path' excel_sheet_name = 'insert sheet name' # create dataframe from desired excel file df = pd.read_excel( excel_file_name, engine='openpyxl', sheet_name=excel_sheet_name ) # drop columns with NaN values and write that into df # # without the inplace option it would have to be # < df = df.dropna(axis=1) > df.dropna(axis=1, inplace=True) # write that dataframe to excel file with pd.ExcelWriter( excel_file_name, # file to write to engine='openpyxl', # which engine to use mode='a', # use mode append (has to be used for if_sheet_exists to work) if_sheet_exists='replace' # if that sheet exists, replace it ) as writer: df.to_excel(writer, sheet_name=excel_sheet_name)
Removing duplicates for a row with duplicates in one column dynamic data
I am attempting to remove duplicates for Column D for dynamic data with no headers or identifying features. I am attempting to delete all the rows where there are duplicates for Column D. I am converting excel to a dataframe, removing duplicates and then putting it back into excel. However I keep getting an assortment of errors or no duplicates removed. I am from a VBA background but we are migrating to Python Attempted: df.drop_duplicates(["C"]) df = pd.DataFrame({"C"}) df.groupby(["C"]).filter(lambda df:df.shape[0] == 1) As well an assortment of other variations. I was able to do this in VBA with one line. Any ideas why this keeps causing this issue. \\ import pandas as pd df = pd.DataFrame({"C"]}) df.drop_duplicates(subset=[''C'], keep=False) DG=df.groupby([''C']) print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1]) I was able to do this in VBA with one line. Any ideas why this keeps causing this issue. Code itself Template- df = pd.read_excel("C:/wadwa.xlsx", sheetname=0) columns_to_drop = ['d.1'] #columns_to_drop = ['d.1', 'b.1', 'e.1', 'f.1', 'g.1'] import pandas as pd Df = df[[col for col in df.columns if col not in columns_to_drop]] print(df) writer = pd.ExcelWriter('C:/dadwa/dwad.xlsx') df.to_excel(writer,'Sheet1') writer.save() print(df) Code: import pandas as pd df = pd.read_excel("C:/Users/Documents/Book1.xlsx", sheetname=0) import pandas as pd df = df.drop_duplicates(subset=[df.columns[3]], keep=False) writer = pd.ExcelWriter('C:/Users//Documents/Book2.xlsx') df.to_excel(writer,'Sheet1') writer.save() print(df)
I think you need assign back and select 4.th columns by position: df = df.drop_duplicates(subset=[df.columns[3]], keep=False)
Python: Manipulate dataframe in user function
I want to manipulate a dataframe in a user written function in python. The manipulating code works fine when I run it outside the function. However, when I put it in the function and run the function it runs without an error but not returns any dataframe. My code looks the following: def reshape(file): from IPython import get_ipython get_ipython().magic('reset -sf') #import packages import pandas as pd import datetime import calendar #define file path and import files path="X:/TEMP/" file_path =path+file df = pd.read_excel(file_path, "Sheet1", parse_dates=["Date"]) #reshape data to panel melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap") melted["id"] = melted["id"].str.replace("id", "") melted.id = melted.id.astype(int) melted.reset_index(inplace=True, drop=True) id_to_string = pd.read_excel(file_path, "Sheet2") id_to_string = id_to_string.transpose() id_to_string.reset_index(level=0, inplace=True) id_to_string.rename(columns = {0: 'id'}, inplace=True) id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True) merged = pd.merge(melted, id_to_string, how="left", on="id") merged = merged.sort(["Date","Market_Cap"], ascending=[1,0]) merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) df = pd.read_excel(file_path, "hardcopy_return", parse_dates=["Date"]) df = df.sort("Date", ascending=1) old = merged merged = pd.merge(old,df, on=["Date", "id"]) merged = merged.set_index("Date") return merged reshape("sample.xlsx") This code runs through but returns nothing. Did I made a mistake in the def command or by calling the function? Any help is greatly appreciated.
I assume this is run with iPython or a jupyter notebook? It might have worked before because the kernel remembers some state. Before making something into a seperate function instead of a straight script, I do a restart kernel & run All On the code itself, I would split the different parts of the code, so it becomes easier to test the individual parts Imports import pandas as pd import datetime import calendar from IPython import get_ipython get_ipython().magic('reset -sf') Read 'Sheet1' get the data from the first worksheet and do the first processing def read_melted(file_path): df1 = pd.read_excel(file_path, sheetname='Sheet1', parse_date["Date"]) melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap") melted.id = melted.id.astype(int) melted.reset_index(inplace=True, drop=True) return melted Read 'Sheet2' def read_id_to_spring(file_path): df2 = pd.read_excel(file_path, sheetname='Sheet2') id_to_string = id2.transpose() id_to_string.reset_index(level=0, inplace=True) id_to_string.rename(columns = {0: 'id'}, inplace=True) id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True) return id_to_string Read 'hardcopy_return' def read_hardcopy_return(file_path): df = pd.read_excel(file_path, sheetname='hardcopy_return', parse_date["Date"]) return df.sort("Date", ascending=1) Tying it together def reshape(df1, df2, df_hardcopy_return): merged = pd.merge(df1, df2, how="left", on="id").sort(["Date","Market_Cap"], ascending=[1,0]) merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) # what does this line do? merged_all = pd.merge(merged,df_hardcopy_return, on=["Date", "id"]).set_index("Date") return merged_all Calling everything path="X:/TEMP/" file_path =path+file df1 = read_melted(file_path) df2 = read_id_to_spring(file_path) df_hardcopy_return = read_hardcopy_return(file_path) reshape(df1, df2, df_hardcopy_return) The only thing that still strikes me as odd is the line merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) read_excel sheetname pandas.read_excel also has a sheetname argument, which you can use to only open the excelfile once. Reading excel files can be slow sometimes, so this might make it faster too
pandas Combine Excel Spreadsheets
I have an Excel workbook with many tabs. Each tab has the same set of headers as all others. I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab). So far, I've tried: import pandas as pd xl = pd.ExcelFile('file.xlsx') df = xl.parse() Can use something for the parse argument that will mean "all spreadsheets"? Or is this the wrong approach? Thanks in advance! Update: I tried: a=xl.sheet_names b = pd.DataFrame() for i in a: b.append(xl.parse(i)) b But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe. import pandas as pd Set sheetname to None in order to load all sheets into a dict of dataframes and ignore index to avoid overlapping values later (see comment by #bunji) df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None) Then concatenate all dataframes cdf = pd.concat(df.values()) print(cdf)
import pandas as pd f = 'file.xlsx' df = pd.read_excel(f, sheet_name=None, ignore_index=True) df2 = pd.concat(df, sort=True) df2.to_excel('merged.xlsx', engine='xlsxwriter', sheet_name=Merged, header = True, index=False)