Creating a dataframe from a csv file in pandas: column issue - python

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the

delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)

Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

Related

Saving columns from csv

I am trying to write a code that reads a csv file and can save each columns as a specific variable. I am having difficulty because the header is 7 lines long (something I can control but would like to just ignore if I can manipulate it in code), and then my data is full of important decimal places so it can not change to int( or maybe string?) I've also tried just saving each column by it's placement in the file but am struggling to run it. Any ideas?
Image shows my current code that I have slimmed to show important parts and circles data that prints in my console.
save each columns as a specific variable
import pandas as pd
pd.read_csv('file.csv')
x_col = df['X']
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
If what you are looking for is how to iterate through the columns, no matter how many there are. (Which is what I think you are asking.) Then this code should do the trick:
import pandas as pd
import csv
data = pd.read_csv('optitest.csv', skiprows=6)
for column in data.columns:
# You will need to define what this save() method is.
# Just placing it here as an example.
save(data[column])
The line about formatting your data as a number or a string was a little vague. But if it's decimal data, then you need to use float. See #9637665.

Pandas read_csv end reading at first linebreak

I am trying to read a csv file with some garbage at the top, but also garbage at the bottom of the interesting data. I need to read multiple files and the length of the interesting data varies. Is there a way to let the pd.read_csv command know that the dataframe ends at the first linebreak?
Example data (screenshot from excel):
I read the file with:
dataframe = pd.read_csv(file, skiprows=45)
Which nicely gives me a dataframe with 10 columns with the headers on line 46 (see image). However, it continues further than the #GARBAGE DATA row.
Important note: Neither the length of the data nor the length of the footer is of equal length in the different files I want to read.
Two ways you could implement this
1) use skipfooter parameter of read csv , it tells the function the Number of lines at bottom of file to skip
pd.read_csv("in.csv",skiprows=45,skipfooter=2)
2) Read the file as it is and later use dropna function, this should drop the Garbage values.
df.dropna(inplace=True)
After using this command:
dataframe = pd.read_csv(file, skiprows=45)
You can use this command:
dataframe= dataframe.dropna(how='any')
This would delete a row if any empty value has been found in that row. Hence it would delete rest of all the rows.

Fixed Width File manipulation in Pandas

I have a fixed-width file with the following format:
5678223313570888271712000000024XAXX0101010006461801325345088800.0784001501.25abc#yahoo.com
5678223324686600271712000000070XAXX0101010006461801325390998280.0784001501.25abcde.12345#gmail.com 5678123422992299
Here's what i tried :
import pandas as pd
ColSpecs = [(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143)]
df = pd.read_fwf("~/filename.txt",colspecs=ColSpecs,Header=True)
Now this surely helps me to convert cleanly in Pandas format. However, the blank(or fixed white spaces) get trimmed off. For Eg: the Email field(#8) has 50 characters set fixed. They get truncated as soon as they're imported to Pandas dataframe.
For the data manipulation, I am creating 3 new fields that are extracted from the values of the previously imported fields.
Final Output file structure:
[(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143),(143,153),(153,163),(164,165)]
Since, I haven't found any to_fwf method on dataframes or any other alternative for Pandas -> Flat File (keeping original lengths intact) , I would really appreciate if anyone has a better solution.
P.S. : I read that awk/sed in Unix works better, but still would like to know for Python

make custom spreadsheets with python

I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?
Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.
You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')
Take a look to xslxwriter. Perhaps it´s what you are looking for.

Producing pandas DataFrame from table in text file

I have some data in a text file which looks like this:
(v14).K TaskList[Parameter Estimation].(Problem)Parameter Estimation.Best Value
5.00885e-007 3.0914e+007
5.75366e-007 2.99467e+007
6.60922e-007 2.99199e+007
I'm trying to get this data into a pandas dataframe. The code I've written below partially works but has formatting issues:
def parse_PE_results(results_file):
with open(results_file) as f:
data=f.readlines()
parameter_value=[]
best_value=[]
for i in data:
split= i.split('\t')
parameter_value.append(split[0])
best_value.append(split[1].rstrip())
pv=pandas.Series(parameter_value,name=parameter_value[0])
bv=pandas.Series(best_value,name=best_value[0])
df=pandas.DataFrame({parameter_value[0]:pv,best_value[0]:bv})
return df
I get the feeling that there must be an easier, more 'pythonic' way of building a data frame from text files. Would anybody happen to know what that is?
Use pandas.read_csv. The entire parse_PE_results function can be replaced with
df = pd.read_csv(results_file, delimiter='\t')
You'll also enjoy better performance by using read_csv instead of calling
data=f.readlines() and looping through it line by line.

Categories