Python : Pandas DataFrame to CSV - python

I want to simply create a csv file from the constructed DataFrame so I do not have to use the internet to access the information. The rows are the lists in the code: 'CIK' 'Ticker' 'Company' 'Sector' 'Industry'
My current code is as follows:
def stockStat():
doc = pq('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
for heading in doc(".mw-headline:contains('S&P 500 Component Stocks')").parent("h2"):
rows = pq(heading).next("table tr")
cik = []
ticker = []
coName = []
sector = []
industry = []
for row in rows:
tds = pq(row).find("td")
cik.append(tds.eq(7).text())
ticker.append(tds.eq(0).text())
coName.append(tds.eq(1).text())
sector.append(tds.eq(3).text())
industry.append(tds.eq(4).text())
d = {'CIK':cik, 'Ticker' : ticker, 'Company':coName, 'Sector':sector, 'Industry':industry}
stockData = pd.DataFrame(d)
stockData = stockData.set_index('Ticker')
stockStat()

As EdChum already mentioned in the comments, creating a CSV out of a pandas DataFrame is done with the DataFrame.to_csv() method.
The dataframe.to_csv() method takes lots of arguments, they are all covered in the DataFrame.to_csv() method documentation. Here is a small example for you:
import pandas as pd
df = pd.DataFrame({'mycolumn': [1,2,3,4]})
df.to_csv('~/myfile.csv')
After this, the myfile.csv should be available in your home directory.
If you are using windows, saving the file to 'C:\myfile.csv' should work better as a proof of concept.

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Python - How to create a pandas Dataframe directly from Smartsheets?

I don't understand how to import a Smartsheet and convert it to a pandas dataframe. I want to manipulate the data from smartsheets, currently I go to smartsheets export to csv and import csv in python but want to eliminate this step so that it can run on a schedule.
import smartsheet
import pandas as pd
access_token ='#################'
smartsheet = Smartsheet(access_token)
sheet = smartsheet.sheets.get('Sheet 1')
pd.DataFrame(sheet)
Here is a simple method to convert a sheet to a dataframe:
def simple_sheet_to_dataframe(sheet):
col_names = [col.title for col in sheet.columns]
rows = []
for row in sheet.rows:
cells = []
for cell in row.cells:
cells.append(cell.value)
rows.append(cells)
data_frame = pd.DataFrame(rows, columns=col_names)
return data_frame
The only issue with creating a dataframe from smartsheets is that for certain column types cell.value and cell.display_value are different. For example, contact columns will either display the name or the email address depending on which is used.
Here is a snippet of what I use when needing to pull in data from Smartsheet into Pandas. Note, I've included garbage collection as I regularly work with dozens of sheets at or near the 200,000 cell limit.
import smartsheet
import pandas as pd
import gc
configs = {'api_key': 0000000,
'value_cols': ['Assigned User']}
class SmartsheetConnector:
def __init__(self, configs):
self._cfg = configs
self.ss = smartsheet.Smartsheet(self._cfg['api_key'])
self.ss.errors_as_exceptions(True)
def get_sheet_as_dataframe(self, sheet_id):
sheet = self.ss.Sheets.get_sheet(sheet_id)
col_map = {col.id: col.title for col in sheet.columns}
# rows = sheet id, row id, cell values or display values
data_frame = pd.DataFrame([[sheet.id, row.id] +
[cell.value if col_map[cell.column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells]
for row in sheet.rows],
columns=['Sheet ID', 'Row ID'] +
[col.title for col in sheet.columns])
del sheet, col_map
gc.collect() # force garbage collection
return data_frame
def get_report_as_dataframe(self, report_id):
rprt = self.ss.Reports.get_report(report_id, page_size=0)
page_count = int(rprt.total_row_count/10000) + 1
col_map = {col.virtual_id: col.title for col in rprt.columns}
data = []
for page in range(1, page_count + 1):
rprt = self.ss.Reports.get_report(report_id, page_size=10000, page=page)
data += [[row.sheet_id, row.id] +
[cell.value if col_map[cell.virtual_column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells] for row in rprt.rows]
del rprt
data_frame = pd.DataFrame(data, columns=['Sheet ID', 'Row ID']+list(col_map.values()))
del col_map, page_count, data
gc.collect()
return data_frame
This adds additional columns for sheet and row IDs so that I can write back to Smartsheet later if needed.
Sheets cannot be retrieved by name, as you've shown in your example code. It is entirely possible for you to have multiple sheets with the same name. You must retrieve them with their sheetId number.
For example:
sheet = smartsheet_client.Sheets.get_sheet(4583173393803140) # sheet_id
http://smartsheet-platform.github.io/api-docs/#get-sheet
Smartsheet sheets have a lot of properties associated with them. You'll need to go through the rows and columns of your sheet to retrieve the information you're looking for, and construct it in a format your other system can recognize.
The API docs contain a listing of properties and examples. As a minimal example:
for row in sheet.rows:
for cell in row.cells
# Do something with cell.object_value here
Get the sheet as a csv:
(https://smartsheet-platform.github.io/api-docs/?python#get-sheet-as-excel-pdf-csv)
smartsheet_client.Sheets.get_sheet_as_csv(
1531988831168388, # sheet_id
download_directory_path)
Read the csv into a DataFrame:
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
pandas.read_csv
You can use this library
Very easy to use and allows Sheets or Reports to be delivered as a Dataframe.
pip install smartsheet-dataframe
Get a report as df
from smartsheet_dataframe import get_as_df, get_report_as_df
df = get_report_as_df(token='smartsheet_auth_token',
report_id=report_id_int)
Get a sheet as df
from smartsheet_dataframe import get_as_df, get_sheet_as_df
df = get_sheet_as_df(token='smartsheet_auth_token',
sheet_id=sheet_id_int)
replace 'smartsheet_auth_token' with your token (numbers and letters)
replace sheet_id_int with your sheet/report id (numbers only)

How can compare two excel files for checking the format in python?

I have one excel sheet with right format(Certain number of headers and specific names). Here I have another excel sheet and I have to check this excel sheet for right format or not(have to be the same number of header and same header names, no issue if the values below header will changed.). how can solve this issue ? NLP or any other suitable method is there?
If you have to compare two Excel you could try something like this (I add also some example Excels):
def areHeaderExcelEqual(excel1, excel2) :
equals = True
if len(excel1.columns) != len(excel2.columns):
return False
for i in range(len(excel1.columns)):
if excel1.columns[i] != excel2.columns[i] :
equals = False
return equals
And that's an application:
import pandas as pd
#create first example Excel
df_out = pd.DataFrame([('string1',1),('string2',2), ('string3',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp1.xlsx')
#create second example Excel
df_out = pd.DataFrame([('string5',1),('string2',5), ('string2',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp2.xlsx')
# create third example Excel
df_out = pd.DataFrame([('string1',1),('string4',2), ('string3',3)], columns=['MyName', 'MyValue'])
df_out.to_excel('tmp3.xlsx')
excel1 = pd.read_excel('tmp1.xlsx')
excel2 = pd.read_excel('tmp2.xlsx')
excel3 = pd.read_excel('tmp3.xlsx')
print(areHeaderExcelEqual(excel1, excel2))
print(areHeaderExcelEqual(excel1, excel3))
Note: Excel's files are provided just to see the different outputs.
For example, excel1 looks like this:
The idea is the same for the other files. To have more insights, see How to create dataframes.
Here's you're code:
f1 = pd.read_excel('file1.xlsx')
f2 = pd.read_excel('file2.xlsx')
print(areHeaderExcelEqual(f1, f2))
You can use pandas for that comparison.
import pandas as pd
f1 = pd.read_excel('sheet1.xlsx')
f2 = pd.read_excel('sheet2.xlsx')
header_threshold = 5 # any number of headers
print(len(f1.columns) == header_threshold)
print(f1.columns) # get the column names as values

How to output my dictionary to an Excel sheet in Python

Background: My first Excel-related script. Using openpyxl.
There is an Excel sheet with loads of different types of data about products in different columns.
My goal is to extract certain types of data from certain columns (e.g. price, barcode, status), assign those to the unique product code and then output product code, price, barcode and status to a new excel doc.
I have succeeded in extracting the data and putting it the following dictionary format:
productData = {'AB123': {'barcode': 123456, 'price': 50, 'status': 'NEW'}
My general thinking on getting this output to a new report is something like this (although I know that this is wrong):
newReport = openpyxl.Workbook()
newSheet = newReport.active
newSheet.title = 'Output'
newSheet['A1'].value = 'Product Code'
newSheet['B1'].value = 'Price'
newSheet['C1'].value = 'Barcode'
newSheet['D1'].value = 'Status'
for row in range(2, len(productData) + 1):
newSheet['A' + str(row)].value = productData[productCode]
newSheet['B' + str(row)].value = productPrice
newSheet['C' + str(row)].value = productBarcode
newSheet['D' + str(row)].value = productStatus
newReport.save('ihopethisworks.xlsx')
What do I actually need to do to output the data?
I would suggest using Pandas for that. It has the following syntax:
df = pd.read_excel('your_file.xlsx')
df['Column name you want'].to_excel('new_file.xlsx')
You can do a lot more with it. Openpyxl might not be the right tool for your task (Openpyxl is too general).
P.S. I would leave this in the comments, but stackoverflow, in their widom decided to let anyone to leave answers, but not to comment.
The logic you use to extract the data is missing but I suspect the best approach is to use it to loop over the two worksheets in parallel. You can then avoid using a dictionary entirely and just append loops to the new worksheet.
Pseudocode:
ws1 # source worksheet
ws2 # new worksheet
product = []
code = ws1[…] # some lookup
barcode = ws1[…]
price = ws1[…]
status = ws1[…]
ws2.append([code, price, barcode, status])
Pandas will work best for this
here are some examples
import pandas as pd
#df columns: Date Open High Low Close Volume
#reading data from an excel
df = pd.read_excel('GOOG-NYSE_SPY.xls')
#set index to the column of your choice, in this case it would be date
df.set_index('Date', inplace = True)
#choosing the columns of your choice for further manipulation
df = df[['Open', 'Close']]
#divide two colums to get the % change
df = (df['Open'] - df['Close']) / df['Close'] * 100
print(df.head())

Categories