Splitting fields from a CSV file using pyspark - python

I'm having issues splitting a CSV file through PySpark. I'm trying to output the country and name of the wine (this is just to prove the parsing is working), but I get an error.
This is how the CSV file looks:
,country,description,designation,points,price,province,region_1,region_2,variety,winery
20,US,"Heitz has made this stellar rosé from the rare Grignolino grape since 1961. Ruby grapefruit-red, it's sultry with strawberry, watermelon, orange zest and salty spice flavor, highlighted with vibrant floral aromas.",Grignolino,95,24.0,California,Napa Valley,Napa,Rosé,Heitz
and here is my code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("SQLProject")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
country = fields[1]
points = fields[4]
return country, points
lines = sc.textFile("file:///Users/luisguillermo/IE/Spark/Final Project/wine-reviews/winemag-data-130k-v2.csv")
rdd = lines.map(parseLine)
results = rdd.collect()
for result in results:
print(result)
And get this error:
File "/Users/luisguillermo/IE/Spark/Final Project/wine-reviews/country_and_points.py", line 10, in parseLine
points = fields[4]
IndexError: list index out of range
It appears that the program gets confused as there are commas in the description. Any ideas on how to fix this?

I would recommend using Spark built-in CSV data source as it provides many options including quotes which is used to read delimiter from columns, Ofcourse the column with delimiter should have quoted with some character.
quotes
When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
If you wanted to read other options Spark CSV provides along with examples, I would suggest reading the following articles.
spark-read-csv-file-into-dataframe
read-csv
Happy Learning !!

See this code:
df = spark.read\
.csv('data.csv')
df.printSchema()
df.show()
The resulting df is a DataFrame with the columns just like the CSV.
See more advance features here

Related

Ignore delimiters between parentheses when reading CSV

I am reading in a csv file delimited by | but some of the data includes extra | characters. When this occurs it appears to be only between two parentheses (example below). I want to be able to read in the data into a dataframe without the columns being messed up (or failing) due to these extra | characters.
Ive been trying to find a way to either
set the pandas read csv delimiter to ignore delimiters between parentheses ()
or
parse over the csv file before loading it to a dataframe and remove any | characters between parentheses ()
I havent had any luck so far. This is sample data that messes up when I attempt to pull into a dataframe.
1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email
I am trying to ignore the | characters between the parentheses () from (ABC(111) to ABC(111),1)
This sample data occurs repeatedly throughout the data so I cant address each time this pattern occurs so I am trying to address it programmatically.
This person seems to be attempting something similar but their solution did not work for me (when changing to |)
Depending on the specifics of your input file you might succeed by applying a regular expression with this strategy:
Replace the sub string containing the brackets with a dummy sub string.
Read the csv.
Re-replace the dummy sub string with the original.
A dirty proof of concept looks as follows:
import re
import pandas as pd
test_line = """1|1|1|1|1|1|1|1||1||1||||2022-01-03 20:14:51|1||1|1|1%' AND 1111=DBMS_PIPE.RECEIVE_MESSAGE(ABC(111)||ABC(111)||ABC(111)||ABC(111),1) AND '%'='||website.com|192.168.1.1|Touch Email"""
# 1. step
dummy_string = '--REPLACED--'
pattern_brackets = re.compile('(\\(.*\\))')
pattern_replace = re.compile(dummy_string)
list_replaced = pattern_brackets.findall(test_line)
csvString = re.sub(pattern_brackets, dummy_string, test_line)
# 2. step
# Now I have to fake your csv
from io import StringIO
csvStringIO = StringIO(csvString)
df = pd.read_csv(csvStringIO, sep="|", header=None)
# 3. step - not the nicest way; just for illustration
# The critical column is the 20th.
new_col = pd.Series(
[re.sub(pattern_replace, list_replaced[nRow], cell)
for nRow, cell in enumerate(df.iloc[:, 20])
])
df.loc[:, 20] = new_col
This works for the line given. Most probably you have to adopt the recipe to the content of your input file. I hope you find your way from here.

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Parse values in text file

I've got a .txt file that looks like this:
id nm lat lon countryCode
5555555 London 55.876456 99.546231 UK
I need to parse each field and add them to a SQLite database. So far I've managed to transfer into my db the id, name and countryCode columns, but I'm struggling to find a solution to parse the lat and lon of each record individually.
I tried with regex, but no luck. I also thought about making a parser to check if the last non-whitespace char is a letter, to determine that the string is lat and not lon, but have no idea how to implement it correctly. Can I solve it using regex or should I use a custom parser? if so, how?
You can do that with pandas like this:
import pandas as pd
import sqlite3
con = sqlite3.connect('path/new.db')
con.text_factory = str
df = pd.read_csv('file_path', sep='\t')
df.to_sql('table_01', con)
If there are bad lines and you can afford to skip them then use this:
df = pd.read_csv('file_path', sep='\t', error_bad_lines=False)
Read more.
Looking at the text file, it looks like it's always the same format for each line. As such, why not just split like this:
for line in lines:
id, nm, lat, lon, code = line.split()
# Insert into SQLite db
With split() you don't have to worry about how much whitespace there is between each token of the string.
using str.split
txt = '5555555 London 55.876456 99.546231 UK'
(id, nm, lat, lon, countryCode) = txt.split()

How to retrieve only the tabular data from a big text file using python

I have a big text file and it contains a tabular form data also
I want to retrieve the data from the tabular form starting with a header and till the end of table in the file(Don't know where the end may be after 20 or 30 lines, the header and end may vary for different files)
I should ignore all the text in the file ,just need that tabular form and put in a separate file.
Example:
brand and dominant market presence in the top life science clusters,
including Greater Boston, the Bay Area, Shoojhriwp,
------
Header
Row1 val1 val2 val3
ROw2 val1 -- ---
row "" "" ""
"" "" "" """
""
(May be end of the table)
again the text........
.........................
,,,,,,,,,,,,,,,,,,,,,,,
So, how can I retrieve the data from the table(same tabular format as it is in the text file) and put it in a file.
I have tried something and it is not working
Easiest way would be something like this (with Pandas installed)
from StringIO import StringIO
import pandas as pd
f = open('path/to/file.txt', 'r')
fileobj = StringIO(f.read())
dataframe = pd.read_csv(fileobj, header=0,
sep='\t', engine="python")
Unless I see a more detailed example with proper formatting, it's hard to write code for it.
So one thing you can do is read the file line by line and once you reach the header of a table (I'm assuming you know beforehand what the table header looks like) you can use the split method on it where you separate it on spaces (or maybe commas) and record this data. Assuming the data in the table follows a fixed structure where every row has the same number of columns once the number of results from split is different you can stop recording data from the file.
Here is how you use the str.split() method
Lets say you have a string
line = "col1 col2 col3"
column_list = line.split()
column_list is now ["col1", "col2", "col3"]
Since in this example there is only 3 elements in the list, what you can do is check the size of the list before you store the values from each line of the table. Once you have a list that has a size different from the previous rows you know you've reached the end of the table in the file

Categories