I have loaded a text file using the load csv function but when I try to print the schema it shows just one field from the root including every row in that one. like this:
root
|-- Prscrbr_Geo_Lvl Prscrbr_Geo_Cd Prscrbr_Geo_Desc Brnd_Name
Any idea how to fix this?
Adding my comment as an answer since it seems to have solved the problem.
From the output, it looks like the CSV file is actually using tab characters as the separator between columns instead of commas. To get Spark to use tabs as the separator, you can use spark.read.format("csv").option("sep", "\t").load("/path/to/file")
How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
This is essentially the same question asked here: How can you parse excel CSV data that contains linebreaks in the data?
But I'm using Python 3 to write my CSV file. Does anyone know if there's a way to add line breaks to cell values from Python?
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
I've tried appending HTML line breaks between each item but the system to where I upload the data doesn't seem to recognize HTML.
Any and all help is appreciated.
Thanks!
Figured it out after playing around and I feel so stupid.
for key in dictionary:
outfile.writerow({
"Order ID": key,
"Item": "\n".join(dictionary[key])
})
Here's an example of what the CSV should look like:
"order_number1", "item1\nitem2"
"order_number2", "item1"
"order_number3", "item1\nitem2\nitem3"
The proper way to use newlines in fields is like this:
"order_number1","item1
item2"
"order_number2","item1"
"order_number3","item1
item2
item3"
The \n you show are just part of the string. Some software may convert it to a newline, other software may not.
Also try to avoid spaces around the separators.
I am trying to find a solution how to substitute the following:
worksheet = writer.sheets['Overview']
worksheet.write_formula('C4', '=MIN('Sheet_147_mB'!C2:C325)')
with something like:
for s in sheet_names:
worksheet.write_formula(row, col, '=MIN(s +'!C2:C325')')
row+=1
to iterate through all the existing sheets in the current xlsx book and write the function to the current sheet having an overview.
After spending some hours I was not able to find any solution therefore it would be hihgly appriciated if someone could point me to any direction. Thank you!
You don't give the error message, but it looks like the problem is with your quoting - you can't nest single quotes like this: '=MIN(s +'!C2:C325')'), and your quotes aren't in the right places. After fixing those problems, your code looks like this:
for s in sheet_names:
worksheet.write_formula(row, col, "=MIN('" + s +"'!C2:C325)")
row+=1
The single quotes are now nested in double quotes (they could also have been escaped, but that's ugly), and the sheet name is enclosed in single quotes, which protects special characters (e.g. spaces).
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")