I have a dataframe x with the size of ( 2000, 3000) . I would like to export it into CSV to use in R. I tried this code:
x.to_csv("ab.csv", sep='\t')
However, when I open in R by the code:
data = read.csv(".data/ab.csv")
The size of data is (2000,1) because the CSV file can not separate into 3000 columns. Is there any solution to keep the same size after exporting ?
By using the parameter sep='\t' you have written a "CSV" which uses tab to separate fields instead of a comma. You could either remove the parameter and write a normal CSV, or use the sep="\t" argument for read.csv in R. If there's no reason to use tab then I would suggest the former option.
try reading the csv file like this
data = read.csv(".data/ab.csv",sep="\t")
Your csv uses \t to separate each value, using the sep pararmeter you have to specify the separator when opening it
Related
I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.
I have some data saved in ".txt" files. this is how they are stored:
I used the code below to read the data and save it in a data frame object: (no need to mention that I'm using pandas library of python):
new_df = pd.read_csv(location, sep='\t', lineterminator='\n', names=None)
the problem is that when I get the shape of my data frame with new_df.shape I end up with: (123,1). It does not recognize that the data have 4 columns. How can I fix this?
It seems you don't have tab but spaces - use sep="\s+"
From your screenshot, your data appear to be in fixed width format.
Try to use pandas.read_fwf to read your data file:
pd.read_fwf(location)
You may pass the colspecs=... argument to tell it in which column each of the data are, but the routine is smart enough to figure this out automagically.
I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).
I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")
I am just getting started with Pyspark and would like to save a file as a csv instead of a text file. I tried using a couple of answers I found on Stack Overflow such as
def toCSVLine(data):
return ','.join(str(d) for d in data)
and then
rdd = lines.map(toCSVLine)
rdd.saveAsTextFile("file.csv")
It works in that I can open it in excel, however all the information is put into column A in the spreadsheet. I would like to be able to put each column in the rdd (an example would be ("ID", "rating") into a separate column in excel so ID would be in column A and rating would be in column B. Is there a way to do this?
If you're on Spark >= 2.0 and assuming your RDD has a tabular format (which it should, given you want to save it as CSV) one way might be to first create a Dataframe from the RDD and then use DataFrameWriter to export to CSV.
from pyspark.sql import SparkSession
spark = SparkSession(sc).getOrCreate()
df = spark.createDataframe(rdd)
df.write.csv("/path/to/file.csv", sep=',', header=True)
Have a look at the pyspark.sql docs for additional options and further information.
In excel are you splitting the file on the ','?
In excel go to the Data tab and select text to columns under data tools then select delimited and hit next. Then select comma as the delimiter and hit finish.
Edit
Generally it would be best practice to create a csv with a different separator character than comma if commas will be in your data. Per your comment if you are creating the csv, just use a different separator (e.g. ';', '|', '^', or tabs). Another option, which I prefer less, is to wrap the field in question in "" like so:
field0,field1,"field,2",field3
Excel should leave what is in quotes alone and only split on commas outside of the quotes. But again this is not my preferred solution.
One option is to convert RDD to dataframe and then save as CSV.
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')
Please see this post I just made:
How to write the resulting RDD to a csv file in Spark python