Read CSV file with specific columns as String in PySpark

Read CSV file with specific columns as String in PySpark - python

I have a large file which is dynamically generated, a small sample of which is given below:
ID,FEES,I_CLSS
11,5555,00000110
12,5555,654321
13,5555,000030
14,5555,07640
15,5555,14550
17,5555,99070
19,5555,090090
My issue is that I will always have a column like I_CLSS that is starting with 0s in this file. I'd like to read the file to a spark dataframe with I_CLSS column as StringType.
For this, in python I can do something like,
df = pandas.read_csv('INPUT2.csv',dtype={'I_CLSS': str})
But is there an alternative to this command in pyspark?
I understand that I can manually specify the schema of a file in Pyspark. But it would be extremely diffficult to do for a file whose columns are dynamically generated.
So I'd appreciate it if somebody could help me with this.

Related

Python reads only one column from my CSV file

first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.

Skipping rows and columns when reading csv with Pandas

I need help about read csv file with pandas.
I have a .csv file that recorded machine parameters and want to read this excel with pandas and analyze. But problem is this excel file not in a proper table format. That means there are a lot of empty rows and columns. Also parameter values are starting from 301st line (example).
How can I read as properly this csv file?

You can use skiprows:
pd.read_csv(csv_file, skiprows=301)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Dataframe to CSV returns one empty column which is visible in the dataframe

After scraping I have put the information in a dataframe and want to export it to a .csv but one of the three columns returns empty in the .csv file ("Content"). This is weird since the all of the three columns are visible in the dataframe, see screenshot.
Screenshot dataframe
Line I use to convert:
df.to_csv('filedestination.csv')
Inspecting the df returns objects:
Inspecting dataframe
Does anyone know how it is possible that the last column, "Content" does not show any data in the .csv file?
Screenshot .csv file
After suggestions it seems that the data is available when opening with .txt. How is it possible that excel does not show the data properly?
Screenshot .txt file data

What is the data type of the Content column?
It is not a string, you can convert that to a string. And then perform df.to_csv

Sometimes, this happens weirdly. View & export will be different. Try Resetting the index before exporting it to .csv/ excel. This always works for me.
df.reset_index()
then,
df.to_csv(r'file location/filename.csv')

How to read an excel file with nested columns in pandas

Using Pandas, I'm trying to read an excel file that looks like the following:
another sample
I tried to read the excel file using the regular approach by running: df = pd.read_excel('filename.xlsx', skiprows=6).
But the problem with it is that I don't get all the columns names needed and most of the column names are Unnamed:1
Is there a way to solve these and read all the columns? Or an approach were I can convert it to a json file

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the

delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)

Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read CSV file with specific columns as String in PySpark - python

Related

Python reads only one column from my CSV file

Skipping rows and columns when reading csv with Pandas

Dataframe to CSV returns one empty column which is visible in the dataframe

How to read an excel file with nested columns in pandas

Creating a dataframe from a csv file in pandas: column issue

Categories

Resources