I'm using Python Pandas reading CSV file, calculating values and then creating a new CSV file of the just calculated values.
In my CSV files I have several columns. I have used sep = ; but then some of the cell values are missing (all of them are first in "General mode" in Excel but when creating a new CSV file they suddenly are missing and became in "Custom mode"). I have used also a sep = , and then I don't miss any values but the final CSV is not very easy to read, because all of the values are in the first and same column.
Any ideas? Thankful of any help!
There is a picture of what I got when using semicolon as a separator.
Related
I have a pandas dataframe abc which I created as follows:
abc = pd.DataFrame({"A":[1,2,3],"B":[2,3,4]})
I added some additional attributes of the dataframe as follows:
abc.attrs = {"Name":"John", "Country":"Nepal"}
I'd like to save the pandas dataframe into an Excel file in xlsx or CSV format. I can do that using abc.to_excel("filename.xlsx") or abc.to_csv("filename.csv") where filename is the required name of the file.
However, I am not able to print the attributes in the saved file. I'd like to save the dataframe in Excel file such that first row gives Name and second row gives Country in two columns as shown below:
How can I do that?
Unfortunately, .to_excel() and .to_csv() do not provide any explicit functionality to insert meta information ahead of the actual dataframe as documented for the Excel and CSV write functions.
Regardless, one could exploit the header argument to hardcode this preamble into the frame. This can be achieved, for example, with
abc.to_csv("filename.csv", header=[str(k) + ',' + str(v) + '\n' for k,v in abc.attrs.items()])
Please note, however, that data tables store homogenous data across rows and columns. Adding meta information on top makes the data harder to read and process. Consider adding it (a) in the file name, (b) in a distinct table, or (c) dropping it altogether.
Additionally, it shall be noted that as of now (Pandas 1.4.3), the attributes feature is experimental and could change/disappear at any future version which makes any implementation brittle.
I have a script that loops through a folder of CSVs, reads them, removes any empty rows (they all have 'empty' rows that Pandas reads as NaN) and appends them to a master dataframe. It then writes the dataframe to a new CSV. This is all working as expected:
if pl.Path(file).suffix == '.csv':
fullPath = os.path.join(sourceLoc, file)
print(file)
initDF = pd.read_csv(fullPath)
cleanDF = initDF.dropna(subset=['Name'])
masterDF = masterDF.append(cleanDF)
masterDF.to_csv(destLoc, index=False)
My only issue is the input dates are displayed like this 25/05/21 but the output dates end up formatted like this 05/25/21. As I'm in the UK and using a UK version of Excel to analyse the output, it's confusing all my functions.
The only solutions I've found so far are to reformat the date columns individually or style them, which to my understanding only affects how they look in Jupyter and not in the actual data. As there are multiple date columns in the source data files I'd rather not have to reformat them all individually.
Is there any way of defining the date format when first creating the dataframe, or reformatting every date column once the dataframe is filled?
In the end this issue was caused by two different problems.
The first was Excel intermittently exporting my dates in US format despite the original format (and my Windows Region settings) being UK format. I've now added a short VBA loop in my export code to ensure those columns are formatted correctly every time the data is exported.
The second was the CSV date being imported with incorrect dtypes. I suspect this was again the fault of Excel (2010 is problematic) but I'm unsure. I'm now correcting this with an astype() method.
The end result is my dates are now imported into Pandas in the correct format and outputted to a new CSV in the correct format too.
I am trying to read a csv file with some garbage at the top, but also garbage at the bottom of the interesting data. I need to read multiple files and the length of the interesting data varies. Is there a way to let the pd.read_csv command know that the dataframe ends at the first linebreak?
Example data (screenshot from excel):
I read the file with:
dataframe = pd.read_csv(file, skiprows=45)
Which nicely gives me a dataframe with 10 columns with the headers on line 46 (see image). However, it continues further than the #GARBAGE DATA row.
Important note: Neither the length of the data nor the length of the footer is of equal length in the different files I want to read.
Two ways you could implement this
1) use skipfooter parameter of read csv , it tells the function the Number of lines at bottom of file to skip
pd.read_csv("in.csv",skiprows=45,skipfooter=2)
2) Read the file as it is and later use dropna function, this should drop the Garbage values.
df.dropna(inplace=True)
After using this command:
dataframe = pd.read_csv(file, skiprows=45)
You can use this command:
dataframe= dataframe.dropna(how='any')
This would delete a row if any empty value has been found in that row. Hence it would delete rest of all the rows.
I'm using a simple code to import an Excel file. However, the command is combining the first two rows into one. I would like to keep it separated (as it is in the Excel file).
db=pd.read_excel('fileaddress', sheetname='Sheet1')
I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?
Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.
You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')
Take a look to xslxwriter. Perhaps it´s what you are looking for.