Getting rows of data frame excluding header in pandas - python

I have a CSV file of which the screenshot is shown below :
My goal is to get the rows of the data frame excluding the header part.
My Purpose: I'm converting data in row into "float64" using the function astype() and then rounding the data to two decimal point using pandas round() function. The code for the same is shown below:
df = pd.read_csv('C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV',skiprows = 1)
df = df.astype('float64', errors="ignore")
df = df.round(decimals=2)
Here as you can see that I'm Skipping the first row in order to exclude the header part.
But Unfortunately, the data is not rounding up to 2 decimals place. The results are shown below :
I am not sure, but I guess the line Empty Data frame is making a problem in rounding up the data.
With Header= None
Even header= 1 will act in the same way the skiprow=1 was acting.
Any suggestions are most welcome...
Thank you

I believe you need:
file = 'C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV'
#for default columns 0,1,2, N with omit first row in original data
df = pd.read_csv(file,skiprows = 1, header=None)
#for columns names by first row in file omit skiprows and header parameters
#df = pd.read_csv(file)
#if necessary, convert to floats
df = df.astype('float64', errors="ignore")
#select only numeric columns
cols = df.select_dtypes(np.number).columns
#round only numeric cols
df[cols] = df[cols].round(decimals=2)

Try the following:
df = pd.read_csv('C:/Users/viral/PycharmProjects/Samples/036_20191009T132130.CSV',skiprows = 1)
df = df.astype('float64', errors="ignore")
df = df.round(decimals=2)

Related

How to prevent Data loss in Pandas.to_excel when handling very long string of numbers

This is my input file (csv)
id1,id2
233924749247492472,9284372492472497294749
298347230474308444,9472943274947429427477
I want to read this file in a dataframe, remove the delimiter and then write it back in .xlsx file
Few code combinations that I have already tried
Attempt 1:
df2 = pd.read_csv(path, sep=Delimiter, float_precision=None )
pd.options.display.float_format = '{:.1f}'.format
df2.to_excel(filepath, index=False)
Attempt 2:
df2 = pd.read_csv(path, sep=delimiter)
writer = pd.ExcelWriter(path, engine=None)
df3.to_excel(writer, index=False)
Attempt 3:
df2 = pd.read_csv(path, sep=delimiter)
df3.to_excel(path, index=False)
Everytime I am getting the same output in excel file
I am seeing a data loss in the first column. The output looks like this:
id1
id2
233924749247493000
9284372492472497294749
298347230474309000
9472943274947429427477
By default, pandas will cast integer as int64. This is enough for integer between -2⁶³ and 2⁶³-1 = 9223372036854775807. So if any element in a column exceeds this value, pandas will set the column type to object.
Apparently, Excel truncates big int (smaller than 2⁶³-1) but not objects. So a solution would be to set the dtypes of all your columns to objects:
pd.read_csv('input.csv', dtype=object).to_excel('output.xlsx')

how to convert column values to str when reading multi-sheet xlsx using pd.read_excel?

I have a muti-sheet xlsx file which I want to process selected pages and finally save them as CSV.
This is a snapshot of a few raws from one page:
I use this code to load all pages and process each one-by-one:
def load_raw_excel_file(file_full_name):
df = pd.read_excel(file_full_name, sheet_name=None, engine="openpyxl", header=0)
sheets_name = list(df.keys())
return df, sheets_name
The output of the code (from the same page) looks like this:
dfs, shs = load_raw_excel_file("myexelfile.xlsx")
dfs['myselectedsheetname']
As you can see, some values from the Contract column have changed to date, but I don't want any changes.
I've tried using convertors and dtype in pd.read_excel, but it didn't work:
df = pd.read_excel(file_full_name, sheet_name=None, engine="openpyxl", header=0, dtype=str)
or
df = pd.read_excel("myexelfile.xlsx", sheet_name='selectedsheetname', header=0, converters={'Contract':str})
any idea?
Update
I found a workaround but not a good solution:
def convert_str_date(x):
try:
y = x.strftime("%b-%y")
return y
except:
return x
df.Contract.apply(lambda x : convert_str_date(x))
Also, see #Simon answer
the excel set those values to datetime format. maybe you can postprocess with the dataframe,
nKCol = df['Contract']
oKCol = df['Contract'].copy()
# update cell to %b-%y string format; Nan if error
nKCol = pd.to_datetime(nKCol, errors='coerce').dt.strftime('%b-%y')
# update the column
df['Contract'] = nKCol
# fill Nan with original column
df['Contract'] = df['Contract'].fillna(oKCol)

Allow duplicate columns in Pandas

I'm splitting a large CSV (containing stock financial data) file into smaller chunks. The format of the CSV file is different. Something like an Excel pivot table. The first few rows of the first column contain some headers.
Company name, id, etc. are repeated across the following columns. Because one single company has more than one attribute, not like one company has one column only.
After the first few rows, the columns then start resembling a typical data frame where headers are in columns instead of rows.
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc after the headers. I know Pandas does not allow this natively, is there a workaround? I tried to set header = None on read_csv but it throws a tokenization error which I think makes sense. I just can't think of an easy way.
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
EDIT:
From, https://github.com/pandas-dev/pandas/issues/19383, I add:
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
So, full code:
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
Now, the entire first row is gone. But, the expected output is for the header row to be replaced with the reset index, without the ".1", ".2", etc.
Screenshot:
The SimFin ID row is no longer there.
This is how I did it:
final_df.columns = final_df.columns.str.split('.').str[0]
Reference:
https://pandas.pydata.org/pandas-docs/stable/text.html
Below solution would ensure that other column names with symbol period ('.') in the dataframe do not get modified
import pandas as pd
from csv import DictReader
csv_file_loc = "file.csv"
# Read csv
df = pd.read_csv(csv_file_loc)
# Get column names from csv file using DictReader
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames
# Rename columns
df.columns = col_names
I know I'm pretty late to the draw on this one, but I'm leaving the solution I came up with in case anyone else wanders across this as I have.
Firstly, the linked question has a pretty nice and dynamic solution that seems to work well even for high column counts. I came across that after I made my solution, haha. Check it out here. Another answer on this thread utilizes the csv library to read and use the column names from that, as it doesn't seem to modify duplicates like Pandas does. That should work fine, but I just wanted to avoid using any extra libraries, especially considering I was originally using csv and then upgrade to Pandas for better functionality.
Now here's my solution. I'm sure it could be done more nicely but this does the job for what I needed and is pretty dynamic, from what I can tell. It basically goes through the columns, checks if it can split the string based on the rightmost "." (that's the rpartition), then does a few more checks from there.
It checks:
Is this string in the colMap? The colMap keeps track of all of the column names, duplicate or not. If this comes back true, then that means it's a duplicate of another column that came before it.
Is the string after the rightmost "." a number? All of the columns are strings, so this just makes sure that whatever it is can be converted into a number to prevent grabbing some other random column that meets previous criteria but isn't actually a dupe from Pandas. eg. "DupeCol" and "DupeCol.Stuff" wouldn't get picked up, but "DupeCol" and "DupeCol.1" would.
Does the number that comes after the rightmost "." match up to the current count of duplicates in the colMap? Seeing as the colMap contains all of the names of the columns, duplicates or not, this will ensure that we're not grabbing a user-named column that managed to overlap with the ".number" convention that Pandas uses. Eg. if a user had named two columns "DupeCol" and "DupeCol.6", it wouldn't get picked up unless there were 6 "DupeCol"s preceding "DupeCol.6", indicating that it almost had to be Pandas that named it that way, as opposed to the user. This part is definitely a bit overkill, but I felt like being extra thorough.
colMap = []
for col in df.columns:
if col.rpartition('.')[0]:
colName = col.rpartition('.')[0]
inMap = col.rpartition('.')[0] in colMap
lastIsNum = col.rpartition('.')[-1].isdigit()
dupeCount = colMap.count(colName)
if inMap and lastIsNum and (int(col.rpartition('.')[-1]) == dupeCount):
colMap.append(colName)
continue
colMap.append(col)
df.columns = colMap
Hopefully this helps someone! Feel free to comment if you think it could use any improvements. I don't entirely love using "continue" in my code, but I'm not sure if that's because it's actually bad practice or just me reading random people complain about it too much. I think it doesn't make the code too unreadable here and prevents the need for duplicating the "else" statement; but let me know if there's a way to improve that or anything otherwise. I'm always looking to learn!
If you know types of all data you may consider loading the csv without header first.
df = pd.read_csv(csv_file, header=None)
df.columns = df.iloc[0] # replace column with first row
df = df.drop(0) # remove the first row
(Note that drop is to remove the row, given that your index is unique, and may not be true if you use index_col argument of pd.read_csv)
caveats: The above solution causes you to lose dtypes infomations.
There is some solution to fix the above problem.
# turn each column into numeric
df = df.apply(lambda col: pd.to_numeric(col, errors='ignore'), axis=0)
Otherwise, you may consider reading the csv twice to get the dtype information and apply the correct convertion.

Python Pandas dataframe reading exact specified range in an excel sheet

I have a lot of different table (and other unstructured data in an excel sheet) .. I need to create a dataframe out of range 'A3:D20' from 'Sheet2' of Excel sheet 'data'.
All examples that I come across drilldown up to sheet level, but not how to pick it from an exact range.
import openpyxl
import pandas as pd
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.get_sheet_by_name('Sheet2')
range = ['A3':'D20'] #<-- how to specify this?
spots = pd.DataFrame(sheet.range) #what should be the exact syntax for this?
print (spots)
Once I get this, I plan to look up data in column A and find its corresponding value in column B.
Edit 1: I realised that openpyxl takes too long, and so have changed that to pandas.read_excel('data.xlsx','Sheet2') instead, and it is much faster at that stage at least.
Edit 2: For the time being, I have put my data in just one sheet and:
removed all other info
added column names,
applied index_col on my leftmost column
then used wb.loc[]
Use the following arguments from pandas read_excel documentation:
skiprows : list-like
Rows to skip at the beginning (0-indexed)
nrows: int, default None
Number of rows to parse.
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
I imagine the call will look like:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, parse_cols = 'A:D')
EDIT:
in later version of pandas parse_cols has been renamed to usecols so the above call should be rewritten as:
df = read_excel(filename, 'Sheet2', skiprows = 2, nrows=18, usecols= 'A:D')
One way to do this is to use the openpyxl module.
Here's an example:
from openpyxl import load_workbook
wb = load_workbook(filename='data.xlsx',
read_only=True)
ws = wb['Sheet2']
# Read the cell values into a list of lists
data_rows = []
for row in ws['A3':'D20']:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
# Transform into dataframe
import pandas as pd
df = pd.DataFrame(data_rows)
my answer with pandas O.25 tested and worked well
pd.read_excel('resultat-elections-2012.xls', sheet_name = 'France entière T1T2', skiprows = 2, nrows= 5, usecols = 'A:H')
pd.read_excel('resultat-elections-2012.xls', index_col = None, skiprows= 2, nrows= 5, sheet_name='France entière T1T2', usecols=range(0,8))
So :
i need data after two first lines ; selected desired lines (5) and col A to H.
Be carefull #shane answer's need to be improved and updated with the new parameters of Pandas

Pandas - read_table read selected lines

I work with text files that contain some basic information in the first 6 rows including empty rows. I have to import, process and export the data into another csv. Here is an example of the first 6 rows:
Foov7.9 - bar.raw created at 10:45:25 on 10.02.2015:
(empty row)
(empty row)
A B C D
a b c d
(empty row)
In pandas I use row 4:
A B C D
as header for the dataframe:
data1 = pd.read_table(dataset1,header = 1, skiprows = (4,5), index_col=None, delimiter=r"\t", engine='python')
When writing to_csv after processing the data I now would like to place back the first 6 rows but I already fail when reading the rows. By solely writing the header from row 4 into the csv I would loose all additional information.
How can I read these rows and later put them back into the csv without interfering with the dataframe header?
There is most likely a more neat way to do it, but it works and it only reads your data once, for performance:
(1) Read data
in_df = pd.read_excel("test.xls", header=0)
(2) create a header for later
header = in_df[:5] #only first rows
(3) save the header columns for concat later
cols = list(header.columns.values) #a list with headers
(4) create a copy for data processing
data = in_df
data.rename(columns=in_df.iloc[2,:], inplace=True) # rename your columns
data = data[5:] # you want just the data body
data = data.reset_index(drop = True) # reindex
#DO WHATEVER WITH DATA
(5) output: concat [header & data]. write output
data.columns = cols # we need the old col names for concat
out_df = pd.concat([header,data]) # do the concat
out_df = out_df.reset_index(drop = True) # reset index (if you want to)
out_df.to_csv("out.csv") #write it. out_df.to_csv("out.csv", index = False) if you don't want index in output

Categories