I have an excel sheet:
31-12-2019 31-01-2020 28-02-2020 *(which btw is formatted as: 31-Dec-19, 31-Jan-20, etc. not sure if relevant)*
1 -0,36% 0,12% -0,09%
2 -0,18% 0,06% -0,07%
3 0,05% 0,04% 0,14%
To be clear, the problem is not in reading the file, but the issue below.
I want to read this file with pandas in python and have the dates in the header as strings. So that later i can to refer to any column with something like df['31-12-2019'].
When I read the excel now, I get a keyerror message, because the formats of the dates in the header are changed. I read it like this now:
curve = pd.read_excel("Monthly curves.xlsx", sheet_name = "swap", skiprows = 1, index_col = 0)
I receive the error when selecting for instance column 31-12-2019: "Keyerror: '31-12-2019'. Any help would be much appreciated!
Also, the first column does not have a header, how can I name it myself as 'years'?
It worked when I used this:
import pandas as pnd
file = 'excelfile.xlsx'
df = pnd.read_excel(file,sheet_name=0,index_col=0)
df.head()
I don't know about naming the headers though...
I worked around my problem by reading the file as follows:
curve = pd.read_excel("Monthly Curves.xlsx", sheet_name = "swap", index_col = 0, skiprows = 2, header = None)
Then to select for instance the 91th column I used .loc (because .ix is deprecated), and I did that in the following way:
M12 = curve.loc[:, 91]
Hope that helps others as well!
Related
I'm trying to export my df to a .csv file. The df has just two columns of data: the image name (.jpg), and the 'value_counts' of how many times that .jpg name occurs in the 'concat_Xenos.csv' file, i.e:
M116_13331848_13109013329679.jpg 19
M116_13331848_13109013316679.jpg 14
M116_13331848_13109013350679.jpg 12
M116_13331848_13109013332679.jpg 11
etc. etc. etc....
However, whenever I export the df, the .csv file only displayes the 'value_counts' column. How do I modify this?
My code is as follows:
concat_Xenos = r'C:\file_path\concat_Xenos.csv'
df = pd.read_csv(concat_Xenos, header=None, index_col=False)[0]
counts = df.value_counts()
export_csv = counts.to_csv (r'C:\file_path\concat_Xenos_valuecounts.csv', index=None, header=False)
Thanks! If any clarification is needed please ask :)
R
This is because the first column is set as index.
Use index=True:
export_csv = counts.to_csv (r'C:\file_path\concat_Xenos_valuecounts.csv', index=True, header=False)
or you can reset your index before exporting.
counts.reset_index(inplace=True)
I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!
I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.
I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')
Try df = pd.read_csv(file, header=None, error_bad_lines=False)
The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.
I have a lot of different table (and other unstructured data in an excel sheet) .. I need to create a dataframe out of range 'A3:D20' from 'Sheet2' of Excel sheet 'data'.
All examples that I come across drilldown up to sheet level, but not how to pick it from an exact range.
import openpyxl
import pandas as pd
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.get_sheet_by_name('Sheet2')
range = ['A3':'D20'] #<-- how to specify this?
spots = pd.DataFrame(sheet.range) #what should be the exact syntax for this?
print (spots)
Once I get this, I plan to look up data in column A and find its corresponding value in column B.
Edit 1: I realised that openpyxl takes too long, and so have changed that to pandas.read_excel('data.xlsx','Sheet2') instead, and it is much faster at that stage at least.
Edit 2: For the time being, I have put my data in just one sheet and:
removed all other info
added column names,
applied index_col on my leftmost column
then used wb.loc[]
Use the following arguments from pandas read_excel documentation:
skiprows : list-like
Rows to skip at the beginning (0-indexed)
nrows: int, default None
Number of rows to parse.
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
I imagine the call will look like:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, parse_cols = 'A:D')
EDIT:
in later version of pandas parse_cols has been renamed to usecols so the above call should be rewritten as:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, usecols= 'A:D')
One way to do this is to use the openpyxl module.
Here's an example:
from openpyxl import load_workbook
wb = load_workbook(filename='data.xlsx',
read_only=True)
ws = wb['Sheet2']
# Read the cell values into a list of lists
data_rows = []
for row in ws['A3':'D20']:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
# Transform into dataframe
import pandas as pd
df = pd.DataFrame(data_rows)
my answer with pandas O.25 tested and worked well
pd.read_excel('resultat-elections-2012.xls', sheet_name = 'France entière T1T2', skiprows = 2, nrows= 5, usecols = 'A:H')
pd.read_excel('resultat-elections-2012.xls', index_col = None, skiprows= 2, nrows= 5, sheet_name='France entière T1T2', usecols=range(0,8))
So :
i need data after two first lines ; selected desired lines (5) and col A to H.
Be carefull #shane answer's need to be improved and updated with the new parameters of Pandas
I'm trying to read large data (thousands of rows) through a python script from csv files which look like this:
.....
2015-11-03 20:16:28,000;63,62;
2015-11-03 20:16:29,000;63,75;
2015-11-03 20:16:30,000;63,86;
2015-11-03 20:16:31,000;64,25;
but it appears that one of the files has extra empty rows that have 196541465 blank spaces — then the code crashes when reading it with read_csv of pandas lib.
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 4221, in append
elif isinstance(other, list) and not isinstance(other[0], DataFrame):
IndexError: list index out of range
I'm using the folowing command:
data = pd.read_csv(input_file,skiprows = [0],usecols=[0,1,2],delimiter=';',decimal=',', names = [ 'date','angle','Unnamed'],na_filter = False,parse_dates = [0],date_parser = reformat_date,error_bad_lines = False,skip_blank_lines=True)#,nrows = 8191)
the culprit row is the 8192'th, when limiting rows (by rows = 8191) it works just fine. I've tried many options from the doc but it doesn't seem to work! Any idea?
I got this error because I was trying to read a CSV file that had too few headers vs. the number of columns (e.g. 10 columns, but only 8 headers. If you set index_col=False, pandas doesn't know what to do with the extra columns)
Edited according to Mitjas comment below.
I just had the same issue and index_col = False didn't work. I had 19 columns and only 17 headers. Solved it with reading columns and headers separately and then adding the header names.
dfcolumns = pd.read_csv('file.csv',
nrows = 1)
df = pd.read_csv('file.csv',
header = None,
skiprows = 1,
usecols = list(range(len(dfcolumns.columns))),
names = dfcolumns.columns)