Handle variable as file with pandas dataframe - python

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )

import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]

It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

Related

pandas read csv is confused when commas within quotes

col1, col2, geometry
11.54000000,0.00000000,"{"type":"Polygon","coordinates":[[[-61.3115751786311,-33.83968838375797],[-61.29737019968823,-33.83207774370677],[-61.29443049860791,-33.83592770721248],[-61.29241347742871,-33.83489393774538],[-61.28994584513501,-33.83806650089736],[-61.292499308117186,-33.83938539699006],[-61.28958106470898,-33.8431993873636],[-61.29307859612687,-33.84495487100211],[-61.295256567865046,-33.846135537383866],[-61.296388484054326,-33.84676149889543],[-61.296747927196776,-33.84651421268175],[-61.297498943449426,-33.84670133707654],[-61.297992472179686,-33.847120134589964],[-61.299741220055196,-33.84901812154847],[-61.3012164422457,-33.85018089588664],[-61.3015892874819,-33.850566250375365],[-61.30284190607861,-33.85079121660985],[-61.30496105223345,-33.848193766906206],[-61.306084952130036,-33.84682375029292],[-61.30707604410075,-33.845532812572294],[-61.30672627175046,-33.84527169005647],[-61.306290670206494,-33.845188781884744],[-61.304604048903514,-33.847304098561025],[-61.30309763921784,-33.84654473836309],[-61.30013213880613,-33.84478736144466],[-61.30110629620797,-33.8431690707163],[-61.303046037678854,-33.844170576767105],[-61.30433047221653,-33.84266156764314],[-61.30484242472771,-33.842899106713375],[-61.30696068650711,-33.844104878773436],[-61.306418212892446,-33.84505221083753],[-61.307163201216696,-33.845464893960255],[-61.30760172622554,-33.84490909256552],[-61.307932962646014,-33.844513681420494],[-61.309176116985405,-33.84280834206188],[-61.30596211112515,-33.841126948963954],[-61.3056475423994,-33.841449215098756],[-61.30526859890979,-33.841557611902374],[-61.30483601097522,-33.84149669494795],[-61.30448925534122,-33.84120408616046],[-61.30410688411086,-33.840609953572034],[-61.30400151682434,-33.839925243738094],[-61.30240379835875,-33.83889223688216],[-61.30188418287129,-33.838444480832685],[-61.301130848179525,-33.83943255499186],[-61.30078636095504,-33.83996223583909],[-61.30059265818967,-33.84016469670277],[-61.30048478527255,-33.840438447848506],[-61.300252198180424,-33.84026774340676],[-61.29876711207748,-33.839489883020924],[-61.29799408649143,-33.840597902688785],[-61.297669258508,-33.84103160870988],[-61.297566592962134,-33.84112444052047],[-61.29748538503245,-33.841083604060834],[-61.297140578061956,-33.84134946797752],[-61.29709617977233,-33.84160419097128],[-61.297170540239335,-33.84168254110631],[-61.297341460506956,-33.84179653572337],[-61.297243418161194,-33.84197105818567],[-61.29699517169225,-33.84200300239938],[-61.29680176950715,-33.84179064473802],[-61.29691703393983,-33.8416707218475],[-61.297053755769845,-33.841604265738546],[-61.29707920124143,-33.84154875978832],[-61.29709391784669,-33.84147543150246],[-61.29711262215961,-33.84133768608576],[-61.296951411710374,-33.84119216012805],[-61.297262269660294,-33.84089514360839],[-61.297626491077864,-33.84051497848962],[-61.29865532547658,-33.83935363544152],[-61.30027710358755,-33.84011486145675],[-61.30046658230606,-33.83996490243917],[-61.30063460268783,-33.83979712050095],[-61.300992098665965,-33.8393813535522],[-61.301799802937595,-33.83832425565103],[-61.30135527704997,-33.837671541923235],[-61.30082030025984,-33.83731962483044],[-61.299512855628244,-33.83689640801839],[-61.29879550338594,-33.8363083288346],[-61.29831419490918,-33.835559835856905],[-61.298360098160686,-33.83408067231082],[-61.29976541168753,-33.83467181800819],[-61.30104200723692,-33.83586895614681],[-61.30133434017162,-33.83606352507277],[-61.30153415160492,-33.836339043812224],[-61.30164813329583,-33.83657891551336],[-61.30124575062752,-33.83743146168004],[-61.30195917352424,-33.83831965157767],[-61.30196183786503,-33.83843401993221],[-61.30250094586367,-33.83890484694379],[-61.304002690127376,-33.83984352469762],[-61.30473149692381,-33.8397514189025],[-61.3054487998093,-33.839941491549894],[-61.30582354557356,-33.84016574092716],[-61.30604808932503,-33.84046128014441],[-61.306143888278996,-33.840801374736316],[-61.30598219492593,-33.841088001849094],[-61.30757239940571,-33.841967156609876],[-61.30920555104759,-33.84277500140921],[-61.3115751786311,-33.83968838375797],[-61.3115751786311,-33.83968838375797]]]}"
How do I read a csv with syntax like above?
I am doing:
import pandas as pd
df = pd.read_csv('file.csv')
However, read_csv gets confused with the , within "{"type":"Polygon","coordinates": I want it to ignore the , within the quotes.
Your csv file contains a MultiIndex, which is causing your read and split issues.
I have tried multiple methods to read your file correctly. The best method that I have found so far is using the Python engine with an advanced separator in the read_csv function.
import pandas as pd
# these are for viewing the output
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 120)
# The separator matches the format of the string that you provided.
# I'm sure that it can be modified to be more efficient.
df = pd.read_csv('test.csv', skiprows=1, sep='(\d{1,2}.\d{1,8}),(\d{1,2}.\d{1,8}),("{"type":.*)',engine="python")
# some cleanup
df = df.drop(df.columns[0], axis=1)
# I had to save the processed file
df.to_csv('test_01.csv')
# read in the new file
df = pd.read_csv('test_01.csv', header=None, index_col=0)
print(df.to_string(index=False))
11.54 0.0 "{"type":"Polygon","coordinates":[[[-61.3115751786311,-33.83968838375797],[-61.29737019968823,-33.83207774370677],[-61.29443049860791,-33.83592770721248],[-61.29241347742871,-33.83489393774538],[-61.28994584513501,-33.83806650089736],[-61.292499308117186,-33.83938539699006],[-61.28958106470898,-33.8431993873636],[-61.29307859612687,-33.84495487100211],[-61.295256567865046,-33.846135537383866],[-61.296388484054326,-33.84676149889543],[-61.296747927196776,-33.84651421268175],[-61.297498943449426,-33.84670133707654],[-61.297992472179686,-33.847120134589964],[-61.299741220055196,-33.84901812154847],[-61.3012164422457,-33.85018089588664],[-61.3015892874819,-33.850566250375365],[-61.30284190607861,-33.85079121660985],[-61.30496105223345,-33.848193766906206],[-61.306084952130036,-33.84682375029292],[-61.30707604410075,-33.845532812572294],[-61.30672627175046,-33.84527169005647],[-61.306290670206494,-33.845188781884744],[-61.304604048903514,-33.847304098561025],[-61.30309763921784,-33.84654473836309],[-61.30013213880613,-33.84478736144466],[-61.30110629620797,-33.8431690707163],[-61.303046037678854,-33.844170576767105],[-61.30433047221653,-33.84266156764314],[-61.30484242472771,-33.842899106713375],[-61.30696068650711,-33.844104878773436],[-61.306418212892446,-33.84505221083753],[-61.307163201216696,-33.845464893960255],[-61.30760172622554,-33.84490909256552],[-61.307932962646014,-33.844513681420494],[-61.309176116985405,-33.84280834206188],[-61.30596211112515,-33.841126948963954],[-61.3056475423994,-33.841449215098756],[-61.30526859890979,-33.841557611902374],[-61.30483601097522,-33.84149669494795],[-61.30448925534122,-33.84120408616046],[-61.30410688411086,-33.840609953572034],[-61.30400151682434,-33.839925243738094],[-61.30240379835875,-33.83889223688216],[-61.30188418287129,-33.838444480832685],[-61.301130848179525,-33.83943255499186],[-61.30078636095504,-33.83996223583909],[-61.30059265818967,-33.84016469670277],[-61.30048478527255,-33.840438447848506],[-61.300252198180424,-33.84026774340676],[-61.29876711207748,-33.839489883020924],[-61.29799408649143,-33.840597902688785],[-61.297669258508,-33.84103160870988],[-61.297566592962134,-33.84112444052047],[-61.29748538503245,-33.841083604060834],[-61.297140578061956,-33.84134946797752],[-61.29709617977233,-33.84160419097128],[-61.297170540239335,-33.84168254110631],[-61.297341460506956,-33.84179653572337],[-61.297243418161194,-33.84197105818567],[-61.29699517169225,-33.84200300239938],[-61.29680176950715,-33.84179064473802],[-61.29691703393983,-33.8416707218475],[-61.297053755769845,-33.841604265738546],[-61.29707920124143,-33.84154875978832],[-61.29709391784669,-33.84147543150246],[-61.29711262215961,-33.84133768608576],[-61.296951411710374,-33.84119216012805],[-61.297262269660294,-33.84089514360839],[-61.297626491077864,-33.84051497848962],[-61.29865532547658,-33.83935363544152],[-61.30027710358755,-33.84011486145675],[-61.30046658230606,-33.83996490243917],[-61.30063460268783,-33.83979712050095],[-61.300992098665965,-33.8393813535522],[-61.301799802937595,-33.83832425565103],[-61.30135527704997,-33.837671541923235],[-61.30082030025984,-33.83731962483044],[-61.299512855628244,-33.83689640801839],[-61.29879550338594,-33.8363083288346],[-61.29831419490918,-33.835559835856905],[-61.298360098160686,-33.83408067231082],[-61.29976541168753,-33.83467181800819],[-61.30104200723692,-33.83586895614681],[-61.30133434017162,-33.83606352507277],[-61.30153415160492,-33.836339043812224],[-61.30164813329583,-33.83657891551336],[-61.30124575062752,-33.83743146168004],[-61.30195917352424,-33.83831965157767],[-61.30196183786503,-33.83843401993221],[-61.30250094586367,-33.83890484694379],[-61.304002690127376,-33.83984352469762],[-61.30473149692381,-33.8397514189025],[-61.3054487998093,-33.839941491549894],[-61.30582354557356,-33.84016574092716],[-61.30604808932503,-33.84046128014441],[-61.306143888278996,-33.840801374736316],[-61.30598219492593,-33.841088001849094],[-61.30757239940571,-33.841967156609876],[-61.30920555104759,-33.84277500140921],[-61.3115751786311,-33.83968838375797],[-61.3115751786311,-33.83968838375797]]]}"
Try this:
pd.read_csv('file.csv',quotechar='"',skipinitialspace=True)

Converting date and time format when importing csv file in Python

I haven't been able to find a solution in similar questions yet so I'll have to give it a go here.
I am importing a csv file looking like this in notepad:
",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","
What I want when I save it as a new csv, is to reformat the date/time and the corresponding value to this (required by analysis software): The semicolon as separator and in the end is important, and I don't really need a header.
2019-07-19 22:31:29;0;
2019-07-19 22:32:01;1;
This is what it looks like in Python:
Item1 = pd.read_csv(r'.\Datafiles\ItemName.csv')
Item1
#Output:
# ,"ItemName"
# 0 Time,"Raw Values"
# 1 7/19/2019 10:31:29 AM," 0",
# 2 7/19/2019 10:32:01 AM," 1",
valve_G1.dtypes
# ,"ItemName" object
# dtype: object
I have tried using datetime without any luck but there might be something fishy with the datatypes that I am not aware of.
What you want in principle is read to DataFrame, convert datetime column and export df to csv again. I think you will need to get rid of the quote-chars to get the import correct. You can do so by reading the file content to a string, replace the '"', and feed that string to pandas.read_csv. EX:
import os
from io import StringIO
import pandas as pd
# this is just to give an example:
s='''",""ItemName"""
"Time,""Raw Values"""
"7/19/2019 10:31:29 PM,"" 0"","
"7/19/2019 10:32:01 PM,"" 1"","'''
f = StringIO(s)
# in your script, make f a file pointer instead, e.g.
# with open('path_to_input.csv', 'r') as f:
# now get rid of the "
csvcontent = ''
for row in f:
csvcontent += row.replace('"', '')
# read to DataFrame
df = pd.read_csv(StringIO(csvcontent), sep=',', skiprows=1, index_col=False)
df['Time'] = pd.to_datetime(df['Time'])
# save cleaned output as ;-separated csv
dst = 'path_where_to_save.csv'
df.to_csv(dst, index=False, sep=';', line_terminator=';'+os.linesep)

pandas: Split a column on delimiter, and get unique values

I am translating some code from R to python to improve performance, but I am not very familiar with the pandas library.
I have a CSV file that looks like this:
O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739
I would like to split the second column on a delimiter (here, a space), and get the unique values in this column. In this case, the code should return [GO:0005737, GO:0015630, GO:0005654 GO:0005794, GO:0005739].
In R, I would do this using the following code:
df <- read.csv("data.csv")
unique <- unique(unlist(strsplit(df[,2], " ")))
In python, I have the following code using pandas:
df = pd.read_csv("data.csv")
split = df.iloc[:, 1].str.split(' ')
unique = pd.unique(split)
But this produces the following error:
TypeError: unhashable type: 'list'
How can I get the unique values in a column of a CSV file after splitting on a delimiter in python?
setup
from io import StringIO
import pandas as pd
txt = """O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739"""
s = pd.read_csv(StringIO(txt), header=None, squeeze=True, index_col=0)
solution
pd.unique(s.str.split(expand=True).stack())
array(['GO:0005737', 'GO:0015630', 'GO:0005654', 'GO:0005794', 'GO:0005739'], dtype=object)

pandas create data frame, floats are objects, how to convert?

I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)

Load data from txt with pandas

I am loading a txt file containig a mix of float and string data. I want to store them in an array where I can access each element. Now I am just doing
import pandas as pd
data = pd.read_csv('output_list.txt', header = None)
print data
This is the structure of the input file: 1 0 2000.0 70.2836942112 1347.28369421 /file_address.txt.
Now the data are imported as a unique column. How can I divide it, so to store different elements separately (so I can call data[i,j])? And how can I define a header?
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns is for naming your columns.
I'd like to add to the above answers, you could directly use
df = pd.read_fwf('output_list.txt')
fwf stands for fixed width formatted lines.
You can do as:
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")
#Pietrovismara's solution is correct but I'd just like to add: rather than having a separate line to add column names, it's possible to do this from pd.read_csv.
df = pd.read_csv('output_list.txt', sep=" ", header=None, names=["a", "b", "c"])
you can use this
import pandas as pd
dataset=pd.read_csv("filepath.txt",delimiter="\t")
If you don't have an index assigned to the data and you are not sure what the spacing is, you can use to let pandas assign an index and look for multiple spaces.
df = pd.read_csv('filename.txt', delimiter= '\s+', index_col=False)
Based on the latest changes in pandas, you can use, read_csv , read_table is deprecated:
import pandas as pd
pd.read_csv("file.txt", sep = "\t")
If you want to load the txt file with specified column name, you can use the code below. It worked for me.
import pandas as pd
data = pd.read_csv('file_name.txt', sep = "\t", names = ['column1_name','column2_name', 'column3_name'])
You can import the text file using the read_table command as so:
import pandas as pd
df=pd.read_table('output_list.txt',header=None)
Preprocessing will need to be done after loading
I usually take a look at the data first or just try to import it and do data.head(), if you see that the columns are separated with \t then you should specify sep="\t" otherwise, sep = " ".
import pandas as pd
data = pd.read_csv('data.txt', sep=" ", header=None)
You can use it which is most helpful.
df = pd.read_csv(('data.txt'), sep="\t", skiprows=[0,1], names=['FromNode','ToNode'])

Categories