Python Pandas CSV - IndexError with axis and size

Python Pandas CSV - IndexError with axis and size - python

I keep getting this error when trying to add an empty column in an imported CSV.
"IndexError: index 27 is out of bounds for axis 0 with size 25"
The original CSV spans A-Z (0-25 columns) then AA, AB, AC, AD (26, 27, 28, 29).
OriginalCSV
The csv with the error currently stretches A-Z but the error occurs when trying to add the column after then - in this case AA. I guess that would be 26.
Problem CSV
Here is the code:
```
#import CSV to dataframe
orders = pd.read_csv("Orders.csv", header=None)
#copy columns needed from order to ordersNewCols
ordersNewCols = orders.iloc[:,[1, 3, 11, 12, 15]]
#create new dataframe - ordersToSubmit
ordersToSubmit = pd.DataFrame()
#copy columns from ordersNewCols to ordersToSubmit
ordersToSubmit = ordersNewCols.copy()
ordersToSubmit.to_csv("ordersToSubmit.csv", index=False)
#Insert empty columns where needed.
ordersToSubmit.insert(2,None,'')
ordersToSubmit.insert(3,None,'')
ordersToSubmit.insert(4,None,'')
ordersToSubmit.insert(6,None'')
ordersToSubmit.insert(7,None,'')
ordersToSubmit.insert(8,None'')
ordersToSubmit.insert(9,None,'')
ordersToSubmit.insert(10,None,'')
ordersToSubmit.insert(11,None,'')
ordersToSubmit.insert(12,None,'')
ordersToSubmit.insert(13,None,'')
ordersToSubmit.insert(14,None,'')
ordersToSubmit.insert(15,None,'')
ordersToSubmit.insert(16,None,'')
ordersToSubmit.insert(18,None,'')
ordersToSubmit.insert(19,None,'')
ordersToSubmit.insert(20,None,'')
ordersToSubmit.insert(21,None,'')
ordersToSubmit.insert(22,None,'')
ordersToSubmit.insert(23,None,'')
ordersToSubmit.insert(27,None,'')
IndexError: index 27 is out of bounds for axis 0 with size 25
'''
How do I expand it to not bring up the error?
CSV screenprint

Without having a look at your csv file, it is hard to tell what is causing this issue.
Anyways...
From pandas.DataFrame.insert documentation
locint
Insertion index. Must verify 0 <= loc <= len(columns).
As you can see, it says loc must be between len(columns), so what you are trying is illegal according to this. I think if you try to insert on an index that is less than len(columns), it will shift remaining column by 1 to the right

Related

How to print dataframe row by row into pdf and to align it in page?

I want to print the dataframe into a pdf, in a table like structure. Also, I have other data that I want to print on the same page.
I tried to print the dataframe row by row and this is what I tried:
from fpdf import FPDF
import pandas as pd
pdf = FPDF(format='letter', unit='in')
pdf.add_page()
pdf.set_font('helvetica', 'BU', 8)
pdf.ln(0.25)
data = [
[1, 'denumire1', 'cant1', 'pret1', 'valoare1'],
[2, 'denumire2', 'cant2', 'pret2', 'valoare2'],
[3, 'denumire3', 'cant3', 'pret3', 'valoare3'],
[4, 'denumire4', 'cant4', 'pret4', 'valoare4'],
]
df = pd.DataFrame(data, columns=['Nr. crt.', 'Denumire', 'Cant.', 'Pret unitar', 'Valoarea'])
for index, row in df.iterrows():
pdf.cell(7, 0.5,str(row['Nr. crt.'])+str(row['Denumire'])+ str(row['Cant.'])+ str(row['Pret unitar'])+ str(row['Valoarea']))
pdf.output('test.pdf', 'F')
However, the format is not readable.
How could I print the dataframe to the pdf using FPDF,so that it aligns in the page?
This is how the dataframe looks now, using the given code:

The fpdf module is a rather low level library. You have to explicitely write each cell after computing the cell width. Here you use a letter size (8 x 11.5 in.), and have 5 columns so a 1.6 width seems legitimate. Code could be:
...
for index, row in df.iterrows():
for data in row.values:
pdf.cell(1.6, 0.5, str(data)) # write each data for the row in its cell
pdf.ln() # go to next line after each row

IMPORTANT: for my solution, we need to iterate through the DataFrame. And I know this is not ideal since it's very time consuming for larger size DataFrames. But since you are printing the results in a table I'm assuming it's a small sample. But consider using more efficient methods.
First, let's import the needed modules and create de DataFrame:
import pandas as pd
import math
from fpdf import FPDF
data = [
[1, 'denumire1', 'cant1', 'pret1', 'valoare1'],
[2, 'denumire2', 'cant2', 'pret2', 'valoare2'],
[3, 'denumire3', 'cant3', 'pret3', 'valoare3'],
[4, 'denumire4', 'cant4', 'pret4', 'valoare4'],
]
df = pd.DataFrame(data, columns=['Nr. crt.', 'Denumire', 'Cant.', 'Pretunitar',
'Valoarea'])
Now we can create our document, add a page and set margins and font
# Creating document
pdf = FPDF("P", "mm", "A4")
pdf.set_margins(left= 10, top= 10)
pdf.set_font("Helvetica", style= "B", size= 14)
pdf.set_text_color(r= 0, g= 0, b= 0)
pdf.add_page()
Now we can create the first element of our table: the header. I'm assuming we will print on the table only the given columns so I'll use their names as headers.
Since we have 5 columns with multiple characters, we must take in consideration the fact that we might need more than one line for the header, in case a cell has too many characters for a single line.
To solve that, line height must be equal to the font size times the number of lines needed (eg.: if you have a str with width of 150 and the cell has width of 100, you will need 2 lines (1.5 rounded up)). But we need to do this to every column name and use the higher value as our number of lines.
Also, I'm assuming you will equally divide the whole width of the page minus margins for the 5 columns (cells).
# Creating our table headers
cell_width = (210 -10 -10) / len(df.columns)
line_height = pdf.font_size
number_lines = 1
for i in df.columns:
new_number_lines = math.ceil(pdf.get_string_width(str(i)) / cell_width)
if new_number_lines > number_lines:
number_lines = new_number_lines
Now, with our line height for the header, we can iterate through the columns names and print each one. I'll use style "B" and size 14 for the headers (defined earlier).
for i in df.columns:
pdf.multi_cell(w= cell_width, h= line_height * number_lines * 1.5,
txt=str(i), align="C", border="B", new_x="RIGHT", new_y="TOP",
max_line_height= line_height)
pdf.ln(line_height * 1.5 * number_lines)
After that we must iterate through all the dataframe and for each iteration we must create cells with the content. Also, for each iteration we have to account for differences in text size and, therefore, number of lines. But by now you probably figured out that the process is the same as before: we iterate through the line to calculate the number of lines needed and then use that value to define cells with the content.
Before printing the body of the table, I'm removing the bold style.
# Changing font style
pdf.set_font("Helvetica", style= "", size= 14)
# Creating our table row by row
for index, row in df.iterrows():
number_lines = 1
for i in range(len(df.columns)):
new_number_lines = math.ceil(pdf.get_string_width(str(row[i])) / cell_width)
if new_number_lines > number_lines:
number_lines = new_number_lines
for i in range(len(df.columns)):
pdf.multi_cell(w=cell_width, h=line_height * number_lines * 1.5,
txt=str(row[i]), align="C", border="B", new_x="RIGHT", new_y="TOP", max_line_height= line_height)
pdf.ln(line_height * 1.5 * number_lines)
pdf.output("table.pdf")

Python Pandas replacing part of a string

I'm trying to filter data that is stored in a .csv file that contains time and angle values and save filtered data in an output .csv file. I solved the filtering part, but the problem is that time is recorded in hh:mm:ss:msmsmsms (12:55:34:500) format and I want to change that to hhmmss (125534) or in other words remove the : and the millisecond part.
I tried using the .replace function but I keep getting the KeyError: 'time' error.
Input data:
time,angle
12:45:55,56
12:45:56,89
12:45:57,112
12:45:58,189
12:45:59,122
12:46:00,123
Code:
import pandas as pd
#define min and max angle values
alpha_min = 110
alpha_max = 125
#read input .csv file
data = pd.read_csv('test_csv3.csv', index_col=0)
#filter by angle size
data = data[(data['angle'] < alpha_max) & (data['angle'] > alpha_min)]
#replace ":" with "" in time values
data['time'] = data['time'].replace(':','')
#display results
print data
#write results
data.to_csv('test_csv3_output.csv')

That's because time is an index. You can do this and remove the index_col=0:
data = pd.read_csv('test_csv3.csv')
And change this line:
data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S')
Output:
time angle
2 124557 112
4 124559 122
5 124600 123

What would print (data.keys()) or print(data.head()) yield? It seems like you have a stray character before\after the time index string, happens from time to time, depending on how the csv was created vs how it was read (see this question).
If it's not a bigger project and/or you just want the data, you could just do some silly workaround like: timeKeyString=list(data.columns.values)[0] (assuming time is the first one).

Trouble importing Excel fields into Python via Pandas - index out of bounds error

I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)

"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.

IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]

I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.

Multiplication of values in a dataframe with scalars

I am working on a problem where I want to convert X and Y pixel values to physical coordinates. I have a huge folder containing many csv files and i load them, pass them to my function, compute the coordinates and overwrite the columns and return the data frame. I then overwrite it outside the function. I have the formula which does it correctly but I am having some problems implementing it in python.
Each CSV files has many columns. The columns I am interested in are Latitude (degree), Longitude (degree), XPOS and YPOS. The former 2 are blank and the latter 2 have the data with which I need to fill up the former two.
import pandas as pd
import glob
max_long = float(XXXX)
max_lat = float(XXXX)
min_long = float(XXXX)
min_lat = float(XXXX)
hoi = int(909)
woi = int(1070)
def pixel2coor (filepath, max_long, max_lat, min_lat, min_long, hoi, woi):
data = pd.read_csv(filepath) #reading Csv
data2 = data.set_index("Log File") #Setting index of dataframe with first column
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
data2.loc[data2['Latitude (degree)']] = (((max_lat-min_lat)/woi)*[data2[:,'YPOS']]+min_lat) #Computing Latitude & Overwriting
return data2 #Return dataframe
filenames = sorted(glob.glob('*.csv'))
for file in filenames:
df = pixel2coor (file, max_long, max_lat, min_lat, min_long, hoi, woi) #Calling pixel 2 coor function and passing a csv file in every iteration
df.to_csv(file) #overwriting the file with the dataframe
I am getting the following error.
**
TypeError: '(slice(None, None, None), 'XPOS')' is an invalid key
**

It looks to me like your syntax is off. In the following line:
data2.loc[data2['Longitude (degree)']] = (((max_long-min_long)/hoi)*[data2[:,'XPOS']]+min_long) #Computing Longitude & Overwriting
The left side of your equation appears to be referring to a column, but you have it in the 'row' section of .loc slicer. So it should be:
data2.loc[:, 'Longitude (degree)']
On the right side of your equation, you've forgotten .loc or need to drop the ':,' so two possible solutions:
(((max_long-min_long)/hoi)*data2.loc[:,'XPOS']+min_long)
(((max_long-min_long)/hoi)*data2['XPOS']+min_long)
Also, I would add that your brackets on the right side should be more explicit. It's a bit unclear how you want scalars to act on the series. Do you want to add min_long first? Or multiply (((max_long-min_long)/hoi) first?
Your final row might look like this, forcing addition first as an example:
data2.loc[:, 'Longitude (degree)'] = ((max_long-min_long)/hoi)*(data2.loc[:,'XPOS']+min_long)
This applies to your next line as well. You may get more errors after you fix this.

Pandas keyerror while dropping value from a column

I am a beginner in Python and getting an error while trying to drop values from a column in pandas dataframe. I keep getting Keyerror after sometime. Here is the code snippet:
for i in data['FilePath'].keys():
if '.' not in data['FilePath'][i]:
value = data['FilePath'][i]
data = data[data['FilePath'] != value]
I keep getting Keyerror near the line "if '.' not in data['FilePath'][i]". Please help me fix this error

If I understand your logic correctly, then you should be be able to do this without a loop. From what I can see, it looks like you want to drop rows if the FilePath column does not begin with .. If this is correct, then below is one way to do this:
Create sample data using nested list
d = [
['BytesAccessed','FilePath','DateTime'],
[0, '/lib/x86_64-linux-gnu/libtinfo.so.5 832.0', '[28/Jun/2018:11:53:09]'],
[1, './lib/x86-linux-gnu/yourtext.so.6 932.0', '[28/Jun/2018:11:53:09]'],
[2, '/lib/x86_64-linux-gnu/mytes0', '[28/Jun/2018:11:53:09]'],
]
data = pd.DataFrame(d[1:], columns=d[0])
print(data)
BytesAccessed FilePath DateTime
0 0 /lib/x86_64-linux-gnu/libtinfo.so.5 832.0 [28/Jun/2018:11:53:09]
1 1 ./lib/x86-linux-gnu/yourtext.so.6 932.0 [28/Jun/2018:11:53:09]
2 2 /lib/x86_64-linux-gnu/mytes0 [28/Jun/2018:11:53:09]
Filtered data to drop rows that do not contain . at any location in the FilePath column
data_filtered = (data.set_index('FilePath')
.filter(like='.', axis=0)
.reset_index())[data.columns]
print(data_filtered)
BytesAccessed FilePath DateTime
0 0 /lib/x86_64-linux-gnu/libtinfo.so.5 832.0 [28/Jun/2018:11:53:09]
1 1 ./lib/x86-linux-gnu/yourtext.so.6 932.0 [28/Jun/2018:11:53:09]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas CSV - IndexError with axis and size - python

Related

How to print dataframe row by row into pdf and to align it in page?

Python Pandas replacing part of a string

Trouble importing Excel fields into Python via Pandas - index out of bounds error

Multiplication of values in a dataframe with scalars

Pandas keyerror while dropping value from a column

Categories

Resources