I created a dictionary with several dataframes using the following code:
dataframes = {}
csv_root = Path(".")
for csv_path in csv_root.glob("*.csv"):
key = csv_path.stem # the filename without the ".csv" extension
dataframes[key] = pd.read_csv(csv_path, skiprows=1,delim_whitespace=True)
However, it is not recognizing all the columns contained within each dataframe, which are divided by a comma "," in csv format. Instead of recognizing 7 columns, it only recognizes 2.
Can someone help me to fix this?
Thanks in advance!
From the documentation :
delim_whitespacebool, default False
Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='\s+'. If this option is
set to True, nothing should be passed in for the delimiter parameter.
By setting this option, you are indicating to pandas to split your csv file on whitespaces rather than on commas (default behavior).
Try with
pd.read_csv(csv_path, skiprows=1)
Related
I am trying to extract the meta data for some experiments I'm helping conduct at school. We are naming our data files something like this:
name_date_sample_environment_run#.csv
What I need to do is write a function that separates each piece to a list that'll be output like this:
['name', 'date', 'sample', 'environment', 'run#']
Though I haven't quite figured it out. I think I need to figure out how to load the file, convert the name to a string, then use a delimiter for each underscore to separate each into the given list. I don't know how to load the file so that I can convert it to a string. Any help will be appreciated!
P.S - I will eventually need to figure out a way to save this data into a spreadsheet so we can see how many experiments we do with certain conditions, who performed them, etc. but I can figure that out later. Thanks!
If you're just asking how to break down the string into all the components separated by an underscore, then the easiest way would be using the split function.
x = 'name_date_sample_environment_run#.csv'
y = x.split('_')
# y = ['name', 'date', 'sample', 'environment', 'run#.csv']
The split function simply breaks down the string every time it sees the underscore. If you want to remove the .csv part from 'run#.csv' then you can process the original string to remove the last 4 characters.
x = 'name_date_sample_environment_run#.csv'
x = x[:-4]
y = x.split('_')
# y = ['name', 'date', 'sample', 'environment', 'run#]
If all your files are structured, and in the same folder you can do this way:
import os
files = os.listdir('.') #insert folder path
structured_files = {}
for file in files:
name, date, sample, environment = file.split('_')
structured_files.append({'name':name, 'date':date, 'sample':sample, 'env':env})
Then you'll have a structure dict with your file info.
If you want to, you can import into pandas, and save to an excel sheet:
import os
import pandas as pd
files = os.listdir('.') #insert folder path
structured_files = {}
for file in files:
name, date, sample, environment = file.split('_')
structured_files.append({'name':name, 'date':date, 'sample':sample, 'env':env})
pd.from_dict(structured_files).to_excel('files.xlsx')
I am copying list output data from a DataCamp course so I can recreate the exercise in Visual Studio Code or Jupyter Notebook. From DataCamp Python Interactive window, I type the name of the list, highlight the output and paste it into a new file in VSCode. I use find and replace to delete all the commas and spaces and now have 142 numeric values, and I Save As life_exp.csv. Looks like this:
43.828
76.423
72.301
42.731
75.32
81.235
79.829
75.635
64.062
79.441
When I read the file into VSCode using either Pandas read_csv or csv.reader and use values.tolist() with Pandas or a for loop to append an existing, blank list, both cases provide me with a list of lists which then does not display the data correctly when I try to create matplotlib histograms.
I used NotePad to save the data as well as a .csv and both ways of saving the data produce the same issue.
import matplotlib.pyplot as plt
import csv
life_exp = []
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row)
And
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
life_exp = life_exp_df.values.tolist()
When you print life_exp after importing using csv, you get:
[['43.828'],
['76.423'],
['72.301'],
['42.731'],
['75.32'],
['81.235'],
['79.829'],
['75.635'],
['64.062'],
['79.441'],
['56.728'],
….
And when you print life_exp after importing using pandas read_csv, you get the same thing, but at least now it's not a string:
[[43.828],
[76.423],
[72.301],
[42.731],
[75.32],
[81.235],
[79.829],
[75.635],
[64.062],
[79.441],
[56.728],
…
and when you call plt.hist(life_exp) on either version of the list, you get each value as bin of 1.
I just want to read each value in the csv file and put each value into a simple Python list.
I have spent days scouring stackoverflow thinking someone has done this, but I can't seem to find an answer. I am very new to Python, so your help greatly appreciated.
Try:
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
# Select the values of your first column as a list
life_exp = life_exp_df.iloc[:, 0].tolist()
instead of:
life_exp = life_exp_df.values.tolist()
With csv reader, it will parse the line into a list using the delimiter you provide. In this case, you provide \n as the delimiter but it will still take that single item and return it as a list.
When you append each row, you are essentially appending that list to another list. The simplest work-around is to index into row to extract that value
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row[0])
However, if your data is not guaranteed to be formatted the way you have provided, you will need to handle that a bit differently:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
for number in row:
life_exp.append(number)
A bit cleaner with list comprehension:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
[life_exp.append(number) for row in exp_read for number in row]
i have a dataset (for compbio people out there, it's a FASTA) that is littered with newlines, that don't act as a delimiter of the data.
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
sample data:
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
every entry is delimited by the ">"
data is split by newlines (limited to, but not actually respected worldwide
with 80 chars per line)
You need to have another sign which will tell pandas when you do actually want to change of tuple.
Here for example I create a file where the new line is encoded by a pipe (|) :
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
Then you read it with the C engine and precise the pipe as the lineterminator :
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
which gives me :
This should work simply by setting skip_blank_lines=True.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
However, I found that I had to set this to False to work with my data that has new lines in it. Very strange, unless I'm misunderstanding.
Docs
There is no good way to do this.
BioPython alone seems to be sufficient, over a hybrid solution involving iterating through a BioPython object, and inserting into a dataframe
Is there a way for pandas to ignore newlines when importing, using any of the pandas read functions?
Yes, just look at the doc for pd.read_table()
You want to specify a custom line terminator (>) and then handle the newline (\n) appropriately: use the first as a column delimiter with str.split(maxsplit=1), and ignore subsequent newlines with str.replace (until the next terminator):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
Returns:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
After pd.read_csv(), you can use df.split().
import pandas as pd
data = pd.read_csv("test.csv")
data.split()
I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")
I a importing a .csv file in python with pandas.
Here is the file format from the .csv :
a1;b1;c1;d1;e1;...
a2;b2;c2;d2;e2;...
.....
here is how get it :
from pandas import *
csv_path = "C:...."
data = read_csv(csv_path)
Now when I print the file I get that :
0 a1;b1;c1;d1;e1;...
1 a2;b2;c2;d2;e2;...
And so on... So I need help to read the file and split the values in columns, with the semi color character ;.
read_csv takes a sep param, in your case just pass sep=';' like so:
data = read_csv(csv_path, sep=';')
The reason it failed in your case is that the default value is ',' so it scrunched up all the columns as a single column entry.
In response to Morris' question above:
"Is there a way to programatically tell if a CSV is separated by , or ; ?"
This will tell you:
import pandas as pd
df_comma = pd.read_csv(your_csv_file_path, nrows=1,sep=",")
df_semi = pd.read_csv(your_csv_file_path, nrows=1, sep=";")
if df_comma.shape[1]>df_semi.shape[1]:
print("comma delimited")
else:
print("semicolon delimited")