I am reading a .csv file into Databricks, but when I read the file I display the result as what is shown in the .csv file - along with the pipe characters, with everything being displayed in one column. This allows me to work on the data. However, I am now trying to take this raw view and read the data as a structured table.
The data that I am displaying in the raw format, in my dataframe is as follows:
|Name|Surname|Age|Gender|
|John|Doe|32|M
|Lisa|Doe|53|F
I would like to take the above and have my output as follows:
|Name|Surname|Age|Gender|
|----|-------|---|------|
|John|Doe|32|M
|Lisa|Doe|53|F
The following is what I do to get the initial output in my dataframe:
df = rdd_df.toDF()
df = df.withColumn('Line', df['_1'].getItem("_c0"))
df.show()
I would appreciate any help.
Related
first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.
After scraping I have put the information in a dataframe and want to export it to a .csv but one of the three columns returns empty in the .csv file ("Content"). This is weird since the all of the three columns are visible in the dataframe, see screenshot.
Screenshot dataframe
Line I use to convert:
df.to_csv('filedestination.csv')
Inspecting the df returns objects:
Inspecting dataframe
Does anyone know how it is possible that the last column, "Content" does not show any data in the .csv file?
Screenshot .csv file
After suggestions it seems that the data is available when opening with .txt. How is it possible that excel does not show the data properly?
Screenshot .txt file data
What is the data type of the Content column?
It is not a string, you can convert that to a string. And then perform df.to_csv
Sometimes, this happens weirdly. View & export will be different. Try Resetting the index before exporting it to .csv/ excel. This always works for me.
df.reset_index()
then,
df.to_csv(r'file location/filename.csv')
I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv") to create the dataframe, but if I look at the tweets column, this is the result I get:
Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance!
It looks like a multiline csv. Try doing
df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True)
I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).
I have a large file which is dynamically generated, a small sample of which is given below:
ID,FEES,I_CLSS
11,5555,00000110
12,5555,654321
13,5555,000030
14,5555,07640
15,5555,14550
17,5555,99070
19,5555,090090
My issue is that I will always have a column like I_CLSS that is starting with 0s in this file. I'd like to read the file to a spark dataframe with I_CLSS column as StringType.
For this, in python I can do something like,
df = pandas.read_csv('INPUT2.csv',dtype={'I_CLSS': str})
But is there an alternative to this command in pyspark?
I understand that I can manually specify the schema of a file in Pyspark. But it would be extremely diffficult to do for a file whose columns are dynamically generated.
So I'd appreciate it if somebody could help me with this.