Pandas dataframe sorting issue - python

I have an excel spreadsheet I create a data frame from. When I run my code I can't get the dataframe to sort correctly on port_count. I'm trying to make it sort on port_count and then display the port that are open for the ip address. This code is almost there, but sorting is giving me a problem.
import pandas as pd
import openpyxl as xl;
data = {'IP': ['192.168.1.1','192.168.1.1','192.168.1.1','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','10.10.10.10','192.168.5.3','192.168.5.3','192.168.4.6','192.168.4.6','192.168.4.7','192.168.4.7','192.168.8.9','192.168.8.9','10.10.2.3','10.10.2.3','10.5.2.3','10.5.2.3','10.1.2.3','10.1.2.3','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','4.5.6.7','4.5.6.7','4.5.6.7','4.5.6.7','4.5.6.7','192.168.9.10','192.168.9.10','192.168.9.11','192.168.9.11','192.168.9.12','192.168.9.12','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','10.1.5.6','8.9.10.11','8.9.10.11','8.9.10.11','2.8.3.9','2.8.3.9','12.13.14.15','13.14.15.16','13.14.15.16','74.208.236.41','74.208.236.41','74.208.236.41','3.234.139.2','3.234.139.2','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','172.67.173.229','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','192.168.60.23','1.2.3.6','192.168.9.6','192.168.9.6','172.16.54.65','172.16.54.65','172.16.54.65','172.16.54.65','172.16.54.66','172.16.54.66','172.16.54.66','172.16.85.36','172.16.85.36','10.10.12.12','10.10.12.12'],
'Port': ['22','80','443','80','443','2082','2083','2086','2087','2095','8080','8443','8880','80','443','80','443','80','443','80','443','80','443','80','443','80','443','21','22','25','80','110','143','443','465','587','993','995','2082','2086','2087','2096','3306','25','80','443','465','587','80','443','80','443','80','443','80','443','2052','2053','2082','2083','2086','2087','2096','8080','8443','8880','5222','8008','8443','80','443','80','80','443','80','81','443','80','443','80','443','2082','2083','2087','8443','8880','80','443','2052','2053','2082','2083','2086','2087','2096','8080','8443','8880','80','80','443','80','82','83','443','80','82','443','80','443','80','443'],
}
df = pd.DataFrame(data)
df['port_count'] = df.groupby('IP')['Port'].transform('count')
df['port_count'] = df['port_count'].astype(int)
df.sort_values(by=['port_count'], ascending=False, inplace=True)
pivot1 = df.pivot_table(df, index=['IP', 'Port'], columns=None, fill_value=0).sort_values(by='port_count', ascending=False)
if df.size != 0:
with pd.ExcelWriter("/testing/test.xlsx", mode="a", engine="openpyxl", if_sheet_exists='replace') as writer:
pivot1.to_excel(writer,sheet_name="IP to Port")
Current output looks like this:
https://www.hopticalillusion.co/shared-files/730/?test_output.xlsx
Desired Output:
https://www.hopticalillusion.co/shared-files/731/?desired_test_output.xlsx

Maybe try the following:
df['port_count'] = df['port_count'].astype(int)
df.sort_values(by=['port_count'], ascending=False, inplace=True)

Related

Pandas dataframe throwing error when appending to CSV

`
import pandas as pd
df = pd.read_csv("stack.csv")
sector_select = "Col2"
df[sector_select] = ["100"]
df.to_csv("stack.csv", index=False, mode='a', header=False)
`
stack.csv has no data other than a header: Col1,Col2,Col3,Col4,Col5
ValueError: Length of values (1) does not match length of index (2)
Im just trying to make a program where I can select a header and append data to the column under that header
You can only run it twice until it gives an error!
You can use this:
df = df.append({"Col2": 100}, ignore_index=True)
That code runs for me.
But I assume that you would like to run something like this:
import pandas as pd
df = pd.read_csv("stack.csv")
sector_select = "Col2"
df.at[len(df), sector_select] = "100"
df.to_csv("stack.csv", index=False)

AttributeError: 'ExcelFile' object has no attribute 'dropna'

I was trying to remove the empty column in an excel file using pandas using dropna() method. But I ended up with the above error message. Please find my code below :
import pandas as pd
df = pd.ExcelFile("1.xlsx")
print(df.sheet_names)
#df.dropna(how='all', axis=1)
newdf = df.dropna()
Please provide more code and context, but this might help:
import pandas as pd
excel_file_name = 'insert excel file path'
excel_sheet_name = 'insert sheet name'
# create dataframe from desired excel file
df = pd.read_excel(
excel_file_name,
engine='openpyxl',
sheet_name=excel_sheet_name
)
# drop columns with NaN values and write that into df
# # without the inplace option it would have to be
# < df = df.dropna(axis=1) >
df.dropna(axis=1, inplace=True)
# write that dataframe to excel file
with pd.ExcelWriter(
excel_file_name, # file to write to
engine='openpyxl', # which engine to use
mode='a', # use mode append (has to be used for if_sheet_exists to work)
if_sheet_exists='replace' # if that sheet exists, replace it
) as writer:
df.to_excel(writer, sheet_name=excel_sheet_name)

Removing duplicates for a row with duplicates in one column dynamic data

I am attempting to remove duplicates for Column D for dynamic data with no headers or identifying features. I am attempting to delete all the rows where there are duplicates for Column D. I am converting excel to a dataframe, removing duplicates and then putting it back into excel. However I keep getting an assortment of errors or no duplicates removed. I am from a VBA background but we are migrating to Python
Attempted:
df.drop_duplicates(["C"])
df = pd.DataFrame({"C"})
df.groupby(["C"]).filter(lambda df:df.shape[0] == 1)
As well an assortment of other variations. I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
\\ import pandas as pd
df = pd.DataFrame({"C"]})
df.drop_duplicates(subset=[''C'], keep=False)
DG=df.groupby([''C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
Code itself Template-
df = pd.read_excel("C:/wadwa.xlsx", sheetname=0)
columns_to_drop = ['d.1']
#columns_to_drop = ['d.1', 'b.1', 'e.1', 'f.1', 'g.1']
import pandas as pd
Df = df[[col for col in df.columns if col not in columns_to_drop]]
print(df)
writer = pd.ExcelWriter('C:/dadwa/dwad.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
Code:
import pandas as pd
df = pd.read_excel("C:/Users/Documents/Book1.xlsx", sheetname=0)
import pandas as pd
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)
writer = pd.ExcelWriter('C:/Users//Documents/Book2.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
I think you need assign back and select 4.th columns by position:
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)

Python: Manipulate dataframe in user function

I want to manipulate a dataframe in a user written function in python. The manipulating code works fine when I run it outside the function. However, when I put it in the function and run the function it runs without an error but not returns any dataframe. My code looks the following:
def reshape(file):
from IPython import get_ipython
get_ipython().magic('reset -sf')
#import packages
import pandas as pd
import datetime
import calendar
#define file path and import files
path="X:/TEMP/"
file_path =path+file
df = pd.read_excel(file_path, "Sheet1", parse_dates=["Date"])
#reshape data to panel
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted["id"] = melted["id"].str.replace("id", "")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
id_to_string = pd.read_excel(file_path, "Sheet2")
id_to_string = id_to_string.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
merged = pd.merge(melted, id_to_string, how="left", on="id")
merged = merged.sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
df = pd.read_excel(file_path, "hardcopy_return", parse_dates=["Date"])
df = df.sort("Date", ascending=1)
old = merged
merged = pd.merge(old,df, on=["Date", "id"])
merged = merged.set_index("Date")
return merged
reshape("sample.xlsx")
This code runs through but returns nothing. Did I made a mistake in the def command or by calling the function? Any help is greatly appreciated.
I assume this is run with iPython or a jupyter notebook?
It might have worked before because the kernel remembers some state. Before making something into a seperate function instead of a straight script, I do a restart kernel & run All
On the code itself, I would split the different parts of the code, so it becomes easier to test the individual parts
Imports
import pandas as pd
import datetime
import calendar
from IPython import get_ipython
get_ipython().magic('reset -sf')
Read 'Sheet1'
get the data from the first worksheet and do the first processing
def read_melted(file_path):
df1 = pd.read_excel(file_path, sheetname='Sheet1', parse_date["Date"])
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
return melted
Read 'Sheet2'
def read_id_to_spring(file_path):
df2 = pd.read_excel(file_path, sheetname='Sheet2')
id_to_string = id2.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
return id_to_string
Read 'hardcopy_return'
def read_hardcopy_return(file_path):
df = pd.read_excel(file_path, sheetname='hardcopy_return', parse_date["Date"])
return df.sort("Date", ascending=1)
Tying it together
def reshape(df1, df2, df_hardcopy_return):
merged = pd.merge(df1, df2, how="left", on="id").sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) # what does this line do?
merged_all = pd.merge(merged,df_hardcopy_return, on=["Date", "id"]).set_index("Date")
return merged_all
Calling everything
path="X:/TEMP/"
file_path =path+file
df1 = read_melted(file_path)
df2 = read_id_to_spring(file_path)
df_hardcopy_return = read_hardcopy_return(file_path)
reshape(df1, df2, df_hardcopy_return)
The only thing that still strikes me as odd is the line merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
read_excel sheetname
pandas.read_excel also has a sheetname argument, which you can use to only open the excelfile once. Reading excel files can be slow sometimes, so this might make it faster too

pandas Combine Excel Spreadsheets

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Categories