Pandas use column names if do not exist - python

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.
Example with header:
Field1 Field2 Field3
data1 data2 data3
Example without header:
data1 data2 data3
When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.
pd.read_csv('filename.csv', names=col_names)
When trying to use the below, it will drop the first row of data of there is no header in the file.
pd.read_csv('filename.csv', header=0, names=col_names)
My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.
df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
del df
df = pd.read_csv('filename.csv', names=col_names)
Is there a better way to handle this data set that doesn't involve potentially reading the file twice?

Just modify your logic so the first time through only reads the first row:
# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)

You can do this with seek method of file descriptor:
with open('filename.csv') as csvfile:
headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
csvfile.seek(0) # return file pointer to the beginning of the file
# do stuff here
if 'Field1' in headers:
...
else:
...
df = pd.read_csv(csvfile, ...)
The file is read only once.

Related

appending a df to new row of csv file adds a empty row between added data

I want to append this single rowed df
rndList = ["albert", "magnus", "calc", 2, 5, "drop"]
rndListDf = pd.DataFrame([rndList])
to a new row of this csv file ,
first,second,third,fourth,fifth,sixth
to place each value under the corespondent column header
using this aproach
rndListDf.to_csv('./rnd_data.csv', mode='a', header=False)
leaves a empty row between header and data in the csv file
how can I append the row without the empty row ?
first,second,third,fourth,fifth,sixth
0,albert,magnus,calc,2,5,drop
I think you have empty lines after your header rows but you can try:
data = pd.read_csv('./rnd_data.csv')
rndListDf.rename(columns=dict(zip(rndListDf.columns, data.columns))) \
.to_csv('./rnd_data.csv', index=False)
Content of your file after this operation:
first,second,third,fourth,fifth,sixth
albert,magnus,calc,2,5,drop
I tested. Code or pandas.to_csv doesn't append new line. It comes from your original csv file. If you are trying to figure out how to add heading to your dataframe:
rndList = ["albert", "magnus", "calc", 2, 5, "drop"]
rndListDf = pd.DataFrame([rndList])
rndListDf.columns = 'first,second,third,fourth,fifth,sixth'.split(',')
rndListDf.to_csv('./rnd_data.csv', index=False)
alternatively, you can first clean your csv as suggested by Corralien and continue doing what you are doing. However, I would suggest to go with Corralien's solution.
# Cleanup
pd.read_csv('./rnd_data.csv').to_csv('rnd_data.csv', index=False)
# Your Code
rndList = ["albert", "magnus", "calc", 2, 5, "drop"]
rndListDf = pd.DataFrame([rndList])
rndListDf.to_csv('./rnd_data.csv', mode='a', header=False)
# Result
first,second,third,fourth,fifth,sixth
albert,magnus,calc,2,5,drop

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

Reading Columns without headers

I have some code that reads all the CSV files in a certain folder and concatenates them into one excel file. This code works as long as the CSV's have headers but I'm wondering if there is a way to alter my code if my CSV's didn't have any headers.
Here is what works:
path = r'C:\Users\Desktop\workspace\folder'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df = df[~df['Ran'].isin(['Active'])]
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.drop_duplicates(subset=None, inplace=True)
What this is doing is deleting any row in my CSV's with the word "Active" under the "Ran" column. But if I didn't have a "Ran" header for this column, is there another way to read this and do the same thing?
Thanks in advance!
df = df[~df['Ran'].isin(['Active'])]
Instead of selecting a column by name, select it by index. If the 'Ran' column is the third column in the csv use...
df = df[~df.iloc[:,2].isin(['Active'])]
If some of your files have headers and some don't then you probably should look at the first line of each file before you make a DataFrame with it.
for filename in all_files:
with open(filename) as f:
first = next(f).split(',')
if first == ['my','list','of','headers']:
header=0
names=None
else:
header=None
names=['my','list','of','headers']
f.seek(0)
df = pd.read_csv(filename, index_col=None, header=header,names=names)
df = df[~df['Ran'].isin(['Active'])]
If I understood your question correctly ...
If the header is missing, yet you know the data format, you can pass the desired column labels as a list, such as: ['id', 'thing1', 'ran', 'other_stuff'] into the names parameter of read_csv.
Per the pandas docs:
names : array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

Pandas Processing Large CSV Data

I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')
If you can fit in memory the set of unique keys:
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False,
chunksize=CHUNK_SIZE,
low_memory=False)
# create a set of (unique) ids
all_ids = set()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
# Filter rows with key in all_ids
df = df.loc[~df['Unique Keys'].isin(all_ids)]
# Add new keys to the set
all_ids = all_ids.union(set(df['Unique Keys'].unique()))
Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)
I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.
As an example with Dask you could do:
import dask.dataframe as dd
df = dd.read_csv(filename, na_filter=False)
df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])
df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.
I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')
Try df = pd.read_csv(file, header=None, error_bad_lines=False)
The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

Categories