printSchema having all columns in the first one - python

I have loaded a text file using the load csv function but when I try to print the schema it shows just one field from the root including every row in that one. like this:
root
|-- Prscrbr_Geo_Lvl Prscrbr_Geo_Cd Prscrbr_Geo_Desc Brnd_Name
Any idea how to fix this?

Adding my comment as an answer since it seems to have solved the problem.
From the output, it looks like the CSV file is actually using tab characters as the separator between columns instead of commas. To get Spark to use tabs as the separator, you can use spark.read.format("csv").option("sep", "\t").load("/path/to/file")

Related

Issue with strings and dictionaries in python

I'm using the click package to get input for one or more variables which get loaded in as a combined dictionary. Each entry is then joined and the combined string is added to the end of a base URL and sent through the requests package to receive some xml data.
Earlier I had an issue with one of the variables that let you search through a range, such as
[value1, value2]
Python added double quotes around it so the search function didn't operate correctly, so I used
.replace('"', '')
on the joined string before combined with the base url and that seemed to fix that problem. The issue now is that individual input that contains more than one word now doesn't produce the same output as the actual search engine online. I have to use quotes when I input the information to keep it as a single argument, but then the quotes get removed by the function above and I believe that is what is causing the issue.
I think if I have a way to access individual entries of this dictionary and remove the double quotes from only certain entries then that should get the job done. But if I am overlooking something please let me know.
Help is appreciated.
Code added below:
import click
import requests
#click.command()
#click.option(--variable1)
#click.option(--variable2)
query_list=[variable1, variable2]
query=''.join(query_list)
base_url = "abc.com...."
response=requests.get(base_url,query)

Is it possible to see the read data of a pytorchtext.data.Tabulardataset?

train, test = data.TabularDataset.splits(path="./data/", train="train.csv",test="test.csv",format="csv",fields=[("Tweet",TEXT), ("Affect Dimension",LABEL)])
I have this code and want to evaluate, if the loaded data is correct or if it's using wrong columns for the actual text fields etc.
If my file has the columns "Tweet" for the Texts and "Affect Dimension" for the Class name, is it correct to put them like this is the fields section?
Edit: TabularDataset includes an Example object, in which the data can be read. When reading csv files, only a "," is accepted as a delimiter. Everything else will result in corrupted data.
You can put any field name irrespective of what your file has. Also, I recommend NOT TO use white-spaces in the field names.
So, rename Affect Dimension to Affect_Dimension or anything convenient for you.
Then you can iterate over different fields like below to check the read data.
for i in train.Tweet:
print i
for i in train.Affect_Dimension:
print i
for i in test.Tweet:
print i
for i in test.Affect_Dimension:
print i

Adding linebreaks within cells in CSV - Python 3

This is essentially the same question asked here: How can you parse excel CSV data that contains linebreaks in the data?
But I'm using Python 3 to write my CSV file. Does anyone know if there's a way to add line breaks to cell values from Python?
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
I've tried appending HTML line breaks between each item but the system to where I upload the data doesn't seem to recognize HTML.
Any and all help is appreciated.
Thanks!
Figured it out after playing around and I feel so stupid.
for key in dictionary:
outfile.writerow({
"Order ID": key,
"Item": "\n".join(dictionary[key])
})
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
The proper way to use newlines in fields is like this:
"order_number1","item1
item2"
"order_number2","item1"
"order_number3","item1
item2
item3"
The \n you show are just part of the string. Some software may convert it to a newline, other software may not.
Also try to avoid spaces around the separators.

xlsxwriter refer to sheets inside worksheet.write_formula()

I am trying to find a solution how to substitute the following:
worksheet = writer.sheets['Overview']
worksheet.write_formula('C4', '=MIN('Sheet_147_mB'!C2:C325)')
with something like:
for s in sheet_names:
worksheet.write_formula(row, col, '=MIN(s +'!C2:C325')')
row+=1
to iterate through all the existing sheets in the current xlsx book and write the function to the current sheet having an overview.
After spending some hours I was not able to find any solution therefore it would be hihgly appriciated if someone could point me to any direction. Thank you!
You don't give the error message, but it looks like the problem is with your quoting - you can't nest single quotes like this: '=MIN(s +'!C2:C325')'), and your quotes aren't in the right places. After fixing those problems, your code looks like this:
for s in sheet_names:
worksheet.write_formula(row, col, "=MIN('" + s +"'!C2:C325)")
row+=1
The single quotes are now nested in double quotes (they could also have been escaped, but that's ugly), and the sheet name is enclosed in single quotes, which protects special characters (e.g. spaces).

How to change position of data columns using regex in a CSV file using comma as separator?

I am giving up with this, its almost due date. I enrolled to regex class this summer (biggest mistake of my life), and we have this topic (where we choose an old software and make updates to it), well I'm almost done with everything but, except this, I have a .txt document of database of monster attributes?
Anyways, the logic is each variable represent columns/keys and each column are separated by comma. And we need to delete/add/reposition the columns using any available tool (regex the only thing I know can help me? do you know anything? )
Here is the OLD form:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,ExpPer,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
First, delete 7th column from the last (deleting all ExpPer entries):
Results to:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Second, duplicate JName column to next column:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Third, pull the last 7 columns, put them starting to 31st column, i.e. from ...,dMotion,Drop1id,Drop1per,... to ...,dMotion,MEXP,...,MVP3per,Drop1id,...
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per
Fourth, Finally, add these columns to the last: ,0,0,DONE,1:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,0,0,DONE,1
Hence, if I run whatever or how many regex search/replace tool,
the original:
1052,ROCKER,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,0,0,0,0,0,0
would result to:
1052,ROCKER,Rocker,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,0,0,0,0,0,0,0,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,DONE,1
Hope somebody can help me, there are 500+ monsters in this old database .txt file.
Thanks!
Microsoft Excel has a Text Import Wizard to import data in any CSV format from any text file into an empty Excel worksheet. This wizard can be used for small CSV files to load the data, next delete/move/copy data columns and finally export/save the modified data again in CSV format into a file.
But the question was about reformatting the CSV file using a text editor with regular expression.
I used UltraEdit v21.20 with selecting Perl regular expression engine, but below should work with any text editor supporting Perl regular expressions. The regular expression search and replace strings should work also with Python.
Important:
The regular expressions below work only if CSV file does not contain commas in double quoted values.
First, delete 7th column from the last (deleting all ExpPer entries):
Search:   ,[^,\r\n]*?(,(?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1
Second, duplicate JName column to next column:
Search:   ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)
Replace: \1\2\2
Third, pull the last 7 columns, put them starting to 31st column:
Search:   ^((?:[^,\r\n]*?,){30})((?:[^,\r\n]*?,){15}[^,]*?),((?:[^,\r\n]*?,){6}[^,\r\n]*)$
Replace: \1\3,\2
Fourth, finally, add ,0,0,DONE,1 to the last:
Search:   (.)$
Replace: \1,0,0,DONE,1
But those 4 replaces can be done also with a single regular expression replace:
Search:   ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)((?:[^,\r\n]*?,){26})((?:[^,\r\n]*?,){16})([^,\r\n]*?,)[^,\r\n]*?,((?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1\2\2\3\5\6,\40,0,DONE,1

Categories