Read one file and rewrite to another, dropping the first column - python

Reading a ;-separated csv-file in an attempt to rewrite it to another csv, separated by ",", the result ends up delivering an additional column as the very first with all the rows counting from 0 to n. How do I leave that new column out?
I import pandas, define the df to read with delimiter=";" (because the file to be read is already in separate columns), then I define "df.to_csv" to rewrite to a new csv-file, separated with commas, i.e. each row is one long string of data.
import pandas as pd
df = pd.read_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\old_file.csv", delimiter=";", encoding='cp1252', error_bad_lines=False)
print(df.columns)
df.to_csv("C:\\Users\\jcst\\Desktop\\Private\\Python data\\new_file.csv", sep=",")
The code runs fine, but the result in new_file.csv looks as follows:
,data,data,data,....
0,data,data,data,....
1,data,data,data,....
2,data,data,data,....
3,data,data,data,....
4,data,data,data,....
…
11,625,data,data,data,....
So, I need to know how to rewrite the code to avoid the leftmost column, counting ,0 , 1, 2,...., 11,625. How do I do that?
Thx, in advance

Tell the exporter not to export the index:
df.to_csv("...", index=False)

Related

Drop rows from Dask DataFrame where column count is not equal

I have a CSV file which I want to normalize for SQL input. I want to drop every line, where's the column count not equal to a certain number within a row, this way I can ignore the bad lines, where column shift can happen. In the past, I used AWK to normalize this CSV dataset, but I want to implement this program in Python for easier parallelization other than GNU Parallel + AWK solution.
I tried the following codes to drop the lines:
df.drop(df[df.count(axis='columns') != len(usecols)].index, inplace=True)
df = df[df.count(axis=1) == len(usecols)]
df = df[len(df.index) == len(usecols)]
None of this work, I need some help, Thank You!
EDIT:
I'm working on a single CSV file on a single worker.
EDIT 2:
Here is the awk script for reference:
{
line = $0;
# ...
if (line ~ /^$/) next; # if line is blank, then remove it
if (NF != 13) next; # if column count is not equal to 13, then remove it
}
The question is not easy to understand. From the first statement it appears as if you are working with a single file, is that correct?
If so, if there are unnamed columns, then there will be an attempt by pandas (or dask via pandas) to 'fix' the structure by adding missing column labels with something like 'Untitled: 0'. Once that happens, it's easy to drop the misaligned rows by using something like:
mask = df['Untitled: 0'].isna()
df = df[mask]
Edit: if there are rows that contain more entries than the number of defined columns, pandas will raise an error, saying it was not able to parse csv.
If, however, you are working with multiple csv files, then one option is to use dask.delayed to enforce compatible columns, see this answer for further guidance.
It's easier to post a separate answer, but it seems that this problem can be solved by passing on_bad_lines kwarg to pandas.read_csv (note: if you are using pandas version lower than 1.3.0, you will need to use error_bad_lines). Roughly, the code would look like this:
from pandas import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
Since dask.dataframe can pass kwargs to pandas, the above can also be written for dask.dataframe:
from dask.dataframe import read_csv
df = read_csv('some_file.csv', on_bad_lines='warn') # can use skip
With this, the imported csv will not reflect any lines that have more columns than expected based on the header (if there is a line with fewer elements than the number of columns, it will be included such that the missing values will be set to None).
I ended up creating a function which pre-processing the zipped CSV file for Pandas/Dask. These are not CPU/Memory heavy tasks, parallelization is not important in this step, so until there's no better way to do this, here we are. I'm adding a proper header for my pre-processed CSV file, too.
with open(csv_filename, 'wt', encoding='utf-8', newline='\n') as file:
join = '|'.join(usecols)
file.write(f"{join}\n") # Adding header
with ZipFile(destination) as z:
with TextIOWrapper(z.open(f"{filename}.csv"), encoding='utf-8') as f:
for line in f:
line = line.strip() # Remove whitespace from line
if line not in ['\n', '\r\n']: # Exclude empty line
array = line.split("|")
if len(array) == column_count:
del array[1:3] # Remove 1st, 2nd element
array = [s.strip() for s in array] # Strip whitespace
join = '|'.join(array)
file.write(f"{join}\n")
# file.close()
PS.: This is not an answer for my original question, that's why I won't accept this.

How to read specific rows and columns, which satisfy some condition, from file while initializing a dataframe in Pandas?

I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)

How to tell pandas to read columns from the left?

I have a csv in which one header column is missing. Eg, I have n data columns, but n-1 header names. When this happens, it seems like pandas shifts my first column to be an index, as shown in the image. So what happens is the column to the right of date_time in the csv, is under the date_time column in the pandas data frame.
My question is: how can I force pandas to read from the left so that the date_time data remains under the date_time column instead of becoming the index? I'm thinking if pandas can simply read from left to right and add dummy column names at the end of the file, that would be great.
Side note: I concede that my input csv should be "clean", however, I think that pandas/frameworks in general should be able to handle the case in which some data might be unclean, but the user wants to proceed with the analysis instead of spending 30 minutes writing a side function/script to fix these minor issues. In my case, the data I care about is usually in the first 15 columns and I don't really care if the columns after that are misaligned. However, when I read the dataframe into pandas, I'm forced to care and waste time fixing these issues even though I don't care about the remaining columns.
Since you don't care about the last column, just set index_col=False
df = pd.read_csv(file, index_col=False)
That way, it will sequentially match the columns with data for the first n-1 columns. Data after that will not be in the data frame
You may also skip the first row to have all your data in the data frame first
df = pd.read_csv(file, skiprows=1)
and then just set the column name after
df.columns = ['col1', 'col2', ....] + ['dummy_col1', 'dummy_col2'...]
where the first list comes from the row=0 of your csv, and the second list you just fill dinamically with a list comprehension.

There is an extra id column in dataFrame read from csv [duplicate]

I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.

Pandas is adding an extra column of data when converting from dta to csv [duplicate]

I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.

Categories