Pyspark: how to read a .csv file? - python

I am trying to read a .csv file that has a strange format.
This is what I am doing
df = spark.read.format('csv').option("header", "true").option("delimiter", ',').load("muyFile.csv"))
df.show(5)
I do not understand why the lonlat entry of the third id is transposed. It seems that the file has two different delimiters. Your help would be much appreciated!

your tag field probably contains comma as a value which is treated as the delimiter.
enclose your data in quotes or any other quote char(remember to set .option('quote','')) and read the data again. It should work

Related

Trying to write/read CSV file with None objects for empty cells [Python]

I'm trying to read data in format CSV using pandas DataFrame so that the empty cells will be recognized as None values.
the delimiter is ',' and I have two of them wherever I need None value. for example, the row:
12345,'abc','abc',,,12,'abc'
Will be converted to a tuple and replaced to:
(12345,'abc','abc',None,None,12,'abc',)
I need it in order to insert data to MySQL later and I'm using cursor.execute() function with the query and the data
I have tried to load the CSV file to a DataFrame and replace but it is not supported:
chunk = chunk.replace(np.nan, None, regex=True)
Any suggestions?
Sorry I did not attain the meaning of the question completely but if it is in regards to CSV, why don't you have arbitrary values of your choice or even empty strings that you can then change later during the programme for when you want to write the data out or when you read.

Python/Pandas adding quotes to string

I'm using Python/Pandas to edit a csv file created by another program.
One of the columns contains values contained within duoble quotes:
"RGB(0,255,255)"
for example.
This is just how it is output by the program and I need to preserve these quotes in order for it to be read back into the program once I have edited it. Currently when I try to exporting the edited data frame to a .csv, the quotes around the values dissapear. so the values look like this:
RGB(0,255,255)
I tried adding quotes manually to the values in the column before exporting, but now the .csv file has triple quotes so looks like this:
"""RGB(0,255,255)"""
I'm not doing anything with this particular column, I literally just need it to retain the format it had before being read into my Python script. I'm assuiming there are some arguments in either my read_csv or to_csv commands but I'm not sure where to start. Any help gratefully appreciated!
Save the DataFrame as a pickle instead.
df.to_pickle('test.pkl')
# To load the dataframe again
df = pd.read_pickle('test.pkl')
This will preserve the structures!

Analyze logs with Python

I have a csv file with logs.
I need to analyze it and select the necessary information from the file.
The problem is that it has a lot of tables with headers. They don't have names.
Tables are separated by empty rows and are also separated from each other.
Let's say I need to select all data from the %idle column, where CPU = all
Structure:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
09:20:06,runq-sz,plist-sz,ldavg-1,ldavg-5,ldavg-15
09:21:06,3,1444,2.01,2.12,2.15
09:22:06,4,1444,2.15,2.14,2.15
You can use below program to parse this csv.
result={}
with open("log.csv","r") as f:
for table in f.read().split("\n\n"):
rows=table.split("\n")
header=rows[0]
for row in rows[1:]:
for i,j in zip(header.split(",")[1:],row.split(",")[1:]):
if i in result:
result[i].append(j)
else:
result[i]=[j]
print(result["%idle"])
Output (values of %idle)
['89.86', '80.18', '89.15', '73.06']
This assumes the table column and row values are in same order and no two tables have common column name.
One rather dumb solution would be to use an "ordinary" file reader for the original CSV. You can read everything up to a new line break as a single CSV and then parse the text you just read in memory.
Every time you "see" a line break, you know to treat it as an entirely new CSV, so you can repeat the above procedure for it.
For example, you would have one string that contained:
09:20:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
09:21:06,all,4.98,0.00,5.10,0.00,0.00,0.00,0.06,0.00,89.86
09:21:06,0,12.88,0.00,5.62,0.03,0.00,0.02,1.27,0.00,80.18
and then parse it in memory. Once you get to the line break after that, you would know that you needed a new string containing the following:
12:08:06,CPU,%usr,%nice,%sys,%iowait,%steal,%irq,%soft,%guest,%idle
12:09:06,all,5.48,0.00,5.24,0.00,0.00,0.00,0.12,0.00,89.15
12:09:06,0,18.57,0.00,5.35,0.02,0.00,0.00,3.00,0.00,73.06
etc. - you can just keep going like this for as many tables as you have.

Pandas, remove outer quote marks from specific columns on export?

I have a specific problem, we are moving our from old to a new system. Old databse was adjusted to a new one with Pandas. However, I am facing a problem.
If file opened with SQL or Csv, it has outer quotes,
"UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)"
I need to make sure it has no upper quotes like this:
UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)
What could be pandas solution to do this for specific columns, on exporting saving to SQL or csv? Because now it's as a string and returns like this?
If your problem is that the old system produces files like .csv with the quotes, you might just want to edit the .csv file itself as described here
If your problem is that pandas saves it as a string with double quotes you can either run the same thing on the csv output of pandas, or you could pass the .to_csv() function the argument
quotechar=''
for which you can find more info on this page
Try to read your data in Pandas using:
df = pd.read_csv(filename, sep=',').replace('"','', regex=True)
This reading line should remove the " " from your data.

read_csv while skipping separator in certain columns

I have a poorly formatted json file.
I am reading it using
mydata=pd.read_csv(afilename,header=0,usecols=[0,1,
4,5,
6,7,
8,9],
names=['ID', 'event',
'a1','a2',
'a3','a4',
'a5','a6'])
Columns 1 and 0 are correctly read.
However, the following columns of my csv file might be malformed and contain stuff like
'{Foo={"name":"bar",quantity:1.0,quantity_type:"baz"}, Fuu={"name":"barbar" '
which include the separator ',' which unfortunately is used a separator also elsewhere , and results in additional splits.
I do not know in advance how many ',' to expect, so everytime I change my usecols/names list to receive fragments of the column that get split due to extra separators, I get errors because the number of columns is not right.
Since you are reading a JSON file you should use the read_json method instead of read_csv. This will work providing your JSON is properly formatted.
For example:
mydata = pd.read_json(afilename, orient='records')

Categories