Python - Trying to extract files from one location to another - python

I am trying to pull a set of files from a server and store in one of the folders in my local. The below code works well for this task. However if any of the files are empty it stops at that point and does not continue further.
list_ = []
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, delim_whitespace=True)
list_.append(df)
temp = pd.concat(list_)
except EmptyDataError:
df = pd.DataFrame()
return df
Could anyone advice as to how could I by-pass these empty files and continue to extract the other files from the server. Thanks
Update:
Given below is the function I am trying to perform
list_ = []
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, header=None, delim_whitespace=True)
list_.append(df)
temp = pd.concat(list_)
except pd.errors.EmptyDataError:
continue
df_v1 = [pd.read_csv(fp, delim_whitespace=True).assign(FileName=os.path.basename(fp)) for fp in allFiles] <<-- Error thrown on this line as per trackback
df = pd.concat(df_v1, ignore_index=True, sort=False)
Trackback:
Traceback (most recent call last):
File "/Users/PycharmProjects/venv/try.py", line 102, in <module>
s3_func("stores","store_a", "2018-10-03", "2018-10-05")
File "/Users/PycharmProjects/venv/try.py", line 86, in s3_func
df_v1 = [pd.read_csv(fp, delim_whitespace=True).assign(FileName=os.path.basename(fp)) for fp in allFiles]
File "/Users/PycharmProjects/venv/try.py", line 86, in <listcomp>
df_v1 = [pd.read_csv(fp, delim_whitespace=True).assign(FileName=os.path.basename(fp)) for fp in allFiles]
File "/Users/PycharmProjects/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/PycharmProjects/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/PycharmProjects/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/Users/PycharmProjects/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/PycharmProjects/venv/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Your loop is exiting upon reaching return condition.
If you want to continue the iteration if exception occurs you can do the following:
list_ = []
for file_ in allFiles:
try:
df = pd.read_csv(file_, index_col=None, delim_whitespace=True)
list_.append(df)
temp = pd.concat(list_)
except EmptyDataError:
df = pd.DataFrame()
continue # Changed return with continue, since return breaks the loop.
Also I see that you are creating empty data frame on exception. What you do with that empty data frame? Do you need it for future usage?
If you would need the empty data frames in future, consider appending them to the list as well
except EmptyDataError:
df = pd.DataFrame()
list_.append(df) # Appending empty dataframes to the list
continue

Related

Iterating over folder names in Python

I have two folders, 1 and 2. I want to go to each folder which has the file Test.xlsx. I tried to iterate on file_loc using i in range(1,3) but there's an error. The code works if I mention 1 or 2 on file_loc.
import pandas as pd
import numpy as np
for i in range(1,3):
file_loc = "C:\\Users\\USER\\OneDrive - Technion\\Research_Technion\\Python_PNM\\Sept12_2022\\i\\Test.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
A=df["N"].to_numpy()
print([A])
A = [x for x in A if str(x) != 'nan']
print(A)
A = [eval(e) for e in A]
print(A)
A=np.array(A)
print([A])
A_mean=[]
for i in range(0,len(A)):
A_mean.append(np.mean(A[i]))
print(*A_mean, sep='\n')
The error is
Traceback (most recent call last):
File "C:\Users\USER\OneDrive - Technion\Research_Technion\Python_PNM\Sept12_2022\Test.py", line 12, in <module>
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
File "C:\Users\USER\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\USER\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 364, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "C:\Users\USER\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1191, in __init__
ext = inspect_excel_format(
File "C:\Users\USER\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1070, in inspect_excel_format
with get_handle(
File "C:\Users\USER\anaconda3\lib\site-packages\pandas\io\common.py", line 711, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\USER\\OneDrive - Technion\\Research_Technion\\Python_PNM\\Sept12_2022\\i\\Test.xlsx'
for i in range(1,3):
file_loc = f"C:\\Users\\USER\\OneDrive - Technion\\Research_Technion\\Python_PNM\\Sept12_2022\\{i}\\Test.xlsx"
...
Make sure you entered correct path

string.split() giving memory error in pandas dataframe

I am trying to split string but getting memory error. Is there any way to solve this or alternative solution for this?
I am getting error below code -
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
Error-
Traceback (most recent call last):
File "ravi_sir.py", line 47, in <module>
df1 = df1[0].str.split(',', expand=True)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2001, in wrapper
return func(self, *args, **kwargs)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2690, in split
return self._wrap_result(result, expand=expand, returns_string=expand)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2272, in _wrap_result
result = cons(result, columns=name, index=index, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/frame.py", line 520, in __init__
mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1739, in form_blocks
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1784, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1830, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
I am trying to read zip file from s3 bucket and saving the content into dataframe to get the total count of files inside that zip file. Creating the dataframe. My full code is given below-
list_table = []
for table in d:
dict_table = OrderedDict()
s_time = datetime.datetime.now().strftime("%H:%M:%S")
print("start_time--->>",s_time)
print("tablename--->>", table)
s3 = boto3.resource('s3')
key='raw/vs-1/load-1619/data' +'/'+ table
obj = s3.Object('********',key)
n = obj.get()['Body'].read()
gzipfile = BytesIO(n)
gzipfile = gzip.GzipFile(fileobj=gzipfile)
content = gzipfile.read()
#print(content)
content_str = content.decode('utf-8')
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
#df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
#print(df1)
#count = os.popen('aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/{0} - | wc -l'.format(table)).read()
count = int(len(df1)) - 2
del(df1)
e_time = datetime.datetime.now().strftime("%H:%M:%S")
print("End_time---->>",e_time)
print(count)
dict_table['Table_Name'] = str(table)
dict_table['Count'] = count
list_table.append(dict_table)
Since you are splitting a huge string using a df column, then deleting the df, looks like you only need the count of commas for each row. So get the count, which is simple, rather than splitting the df -- which could generate a huge amount of columns and therefore cause your memory error.
row1list = ['1,2,3,4']
row2list = ['5,6']
row3list = ['7,8,9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['col'])
df['count_commas'] = df['col'].str.count(',')
print(df)
# col count_commas
# 0 1,2,3,4 3
# 1 5,6 1
# 2 7,8,9 2

How to set many columns on Pandas Python?

I want to insert more than a hundred columns into a CSV file. But seems pandas library has limited columns.
Here is the error message:
Traceback (most recent call last):
File "metric.py", line 91, in <module>
finalFile(sys.argv[1])
File "metric.py", line 80, in finalFile
data = pd.read_csv(f, header=None, dtype=str)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 948, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 2010, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
And below is my function:
def finalFile(fname):
output = pd.DataFrame()
for file_name in os.listdir('test/sciprt-temp/'):
if file_name.startswith(fname):
with open(os.path.join('test/sciprt-temp/', file_name)) as f:
data = pd.read_csv(f, header=None, dtype=str)
output[file_name.rsplit('.', 4)[2]] = data[1]
output.insert(0, 'timestamp', dt.datetime.now().timestamp())
output.insert(0, 'hostname', fname.rsplit('-', 3)[0])
output.set_index(output.columns[0], inplace=True)
output.to_csv(fname.rsplit('.', 2)[2] + ".csv")
finalFile(sys.argv[1])
It seems to work fine when inserting few columns but not working with more columns.
hostname,timestamp,-diskstats_latency-sda-avgrdwait-g,-diskstats_latency-sda-avgwait-g,-diskstats_latency-sda-avgwrwait-g,-diskstats_latency-sda-svctm-g,-diskstats_latency-sda_avgwait-g
test.test.com,1617779170.62498,2.7979746835e-03,6.6681051841e-03,7.1533659185e-03,2.5977601795e-04,6.6681051841e-03

FileNotFoundError: [Errno 2] File stock_dfs/BRK.B.csv does not exist: how to adjust code if file doesn't exist:

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
def save_sp500_tickers():
resp =
requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text
tickers.append(ticker)
with open("sp500tickers.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
if reload_sp500:
tickers = save_sp500_tickers()
else:
with open("sp500tickers.pickle", "rb") as f:
tickers = pickle.load(f)
if not os.path.exists('stock_dfs'):
os.makedirs('stock_dfs')
start = dt.datetime(2010, 1, 1)
end = dt.datetime.now()
for ticker in tickers:
# just in case your connection breaks, we'd like to save our progress!
if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
df = web.DataReader(ticker, 'yahoo', start, end)
df.reset_index(inplace=True)
df.set_index("Date", inplace=True)
df = df.drop("Symbol", axis=1)
df.to_csv('stock_dfs/{}.csv'.format(ticker))
else:
print('Already have {}'.format(ticker))
def compile_data():
with open("sp500tickers.pickle", "rb") as f:
tickers = pickle.load(f)
main_df = pd.DataFrame()
for count, ticker in enumerate(tickers):
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
error occurs here it states "File stock_dfs/BRK.B.csv does not exist" but it wasnt imported / stored locally in the first place so why is this an issue? full error at the bottom
df.set_index('Date', inplace=True)
df.rename(columns={'Adj Close': ticker}, inplace=True)
df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df, how='outer')
if count % 10 == 0:
print(count)
print(main_df.head())
main_df.to_csv('sp500_joined_closes.csv')
compile_data()
pandas_datareader\compat__init__.py:7: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
from pandas.util.testing import assert_frame_equal
0
10
20
30
40
50
60
Traceback (most recent call last):
File "C:\Users\Desktop\Python tutorials\Python finance Examples\py3tutorialSP500manip.py", line 75, in
compile_data()
File "C:\Users\Desktop\Python tutorials\Python finance Examples\py3tutorialSP500manip.py", line 57, in compile_data
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 880, in init
self._make_engine(self.engine)
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\io\parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
File "pandas_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File stock_dfs/BRK.B.csv does not exist: 'stock_dfs/BRK.B.csv'
following this tutorial:
https://pythonprogramming.net/combining-stock-prices-into-one-dataframe-python-programming-for-finance/
Right before the error occurs, a call is made to pd.read_csv():
for count, ticker in enumerate(tickers):
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
The error occurs when ticker is 'BRK.B' and the program attempts to set read data from 'stocks_dfs/BRK.B.csv'.
The error message is saying that there is no stocks_dfs/BRK.B.csv file on your machine. This is puzzling since this bit of code should have downloaded all the necessary files:
for ticker in tickers:
# just in case your connection breaks, we'd like to save our progress!
if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
df = web.DataReader(ticker, 'yahoo', start, end)
df.reset_index(inplace=True)
df.set_index("Date", inplace=True)
df = df.drop("Symbol", axis=1)
df.to_csv('stock_dfs/{}.csv'.format(ticker))
else:
print('Already have {}'.format(ticker))
Make sure you ran the download code (directly above) in the same directory you are running the read code (top). To run a quick check, see if there exists a folder called stock_dfs/ in you working directory. That folder should contain files like GOOGL.csv, FB.csv, and specifically BRK.B.csv.

Concat excel files and worksheets into one using python

I have many excel files in a directory, all of them has the same header row. Some of these excel files has multiple worksheets which again have the same headers. I'm trying to loop through the excel files in the directory and for each one check if there are multiple worksheets to concat them as well as the rest of the excel files.
This is what I tried:
import pandas as pd
import os
import ntpath
import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(dir_path)
for excel_names in glob.glob('*.xlsx'):
# read them in
i=0
df = pd.read_excel(excel_names[i], sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
cdf.to_excel("c.xlsx", header=False, index=False)
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
i+=1
but then I get the below error any advice?
"concat excel.py", line 12, in <module>
df = pd.read_excel(excel_names[i], sheet_name=None, ignore_index=True)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python2.7/site-packages/pandas/io/excel.py", line 653, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python2.7/site-packages/pandas/io/excel.py", line 424, in __init__
self.book = xlrd.open_workbook(filepath_or_buffer)
File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 111, in open_workbook
with open(filename, "rb") as f:
IOError: [Errno 2] No such file or directory: 'G'
Your for statement is setting excel_names to each filename in turn (so a better variable name would be excel_name):
for excel_names in glob.glob('*.xlsx'):
But inside the loop your code does
df = pd.read_excel(excel_names[i], sheet_name=None, ignore_index=True)
where you are clearly expecting excel_names to be a list from which you are extracting one element. But it isn't a list, it's a string. So you are getting the first character of the first filename.

Categories