I'm trying to read in a text file on Amazon EMR using the python spark libraries. The file is in the home directory (/home/hadoop/wet0), but spark can't seem to find it.
Line in question:
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
Error:
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-19-121.us-west-2.compute.internal:8020/user/hadoop/wet0;'
Does the file have to be in a specific directory? I can't find information about this anywhere on the AWS website.
If its in the local filesystem, the URL should be file://user/hadoop/wet0
If its in HDFS, that should be a valid path. Use the hadoop fs command to take a look
e.g: hadoop fs -ls /home/hadoop
one think to look at, you say it's in "/home/hadoop", but the path in the error is "/user/hadoop". Make sure you aren't using ~ in the command line, as bash will do the expansion before spark sees it. Best to use the full path /home/hadoop
I don't know if it's just me, but when I tried to solve the problem with the suggestion above, I got an error "path does not exist" in my EMR. I just added one more "/" before user and it worked.
file:///user/hadoop/wet0
Thanks for the help!
Related
I've hit another bug. I'm now trying to set up continuous deployment for Google Cloud Run from my GitHub, and it's not finding my Dockerfile. I've tried various combinations with my file paths, but it still gives me the
Already have image (with digest): gcr.io/cloud-builders/docker
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /workspace/Dockerfile: no such file or directory
error.
This is my file structure from VSCode, and if I run the command from the first Exeplore folder, it finds the Dockerfile
The source directory is public, so I just can't figure out why it's not finding the Dockerfile.
https://github.com/Pierre-siddall/exeplore
Any advice at all would be greatly appreciated, thank you!
EDIT 1:
For some reason, the file path had some capitalisation that was different than what was on my VSCode - I suspect something on Github's end. Now I'm having issues getting the continuous deployment to actually update the page, but it is redeploying, but not updating now.
Hi,
I have a library called BioPython for Bioinformatics with several files.
I am unable to retrieve the file and it gives me the above errors.
My Python 3.8.2 IDE shell is in documents and my BioInformatics Library file is in documents as well. I think there is something wrong with the pathway of my package for python3.8 but I am not sure. Can someone guide me towards the right direction?
Keeping python shell and the file you want to access in the same directory does not matter. The only thing that matters is that your working working directory must have the file in order to import it without specifying the complete path.
Well, the error clearly shows that the name of the file has been either changed or it might have been moved to some other directory. It is better to specify the full path.
I've already parsed my data in a geoff .txt file.
I've downloaded the three files and I've put the two .jar files under
/Applications/Neo4j\Community\Edition.app/Contents/Resources/app/plugins
Since my server folder is located in a different directory, I've added the line:
org.neo4j.server.thirdparty_jaxrs_classes=com.nigelsmall.load2neo=/load2neo
to my configuration file that's at:
/Users/Lucas/Documents/Neo4j/testdb/neo4j.properties
When I try to start the server I get:
Starting Neo4j failed: Multiple exceptions
I'm pretty sure that I'm placing files in wrong locations and/or using wrong paths, but I don't have a clue on how to fix it.
Any help? Thanks!
There were two issues:
1.The path given on the configuration file was incorrect, in my case, the correct path was: org.neo4j.server.database.location=/Users/Lucas/Documents/Neo4j/testdb
2.I had another process using the port 7474. After I've changed the path to the correct value, the neo4j application gave me a error message about the port being used and I've killed the process with lsof -i :<port> and kill -9 <PID>
Hope it helps someone (:
I'm trying to use data from CSV files in PySpark. I found a module called PySpark-CSV which does exactly what I need. According to the PySpark-CSV GitHub page, "no installation [is] required", so I figured I could just unzip the source in a directory called 'pyspark_csv' in my Python path and run the commands listed on their website:
import pyspark_csv as pycsv
sc.addPyFile('pyspark_csv.py')
But this renders me with an import error saying that it can't find pyspark_csv.
The README doesn't help me any further and other info is scarce. Anyone here familiar with the module?
It means Python cannot find pyspark_csv.py. This is because you put the file in pyspark_csv and Python unaware that. Let say a full path of the directory is `/foo/pyspark_csv'. You can modify PYTHONPATH, or use other methods to inform Python where have you put your files.
#Run this in bash shell before you excute python
#Or put thisline in a bottom of .bashrc file.
export PYTHONPATH=$PYTHONPATH:/foo/pyspark_csv
Use a full path for Spark, too:
sc.addPyFile('/foo/pyspark_csv/pyspark_csv.py')
I am trying to run a Python script runs every few minutes using Cron Jobs. It reads a CSV file and adds data to a MySQL table. I am using HostGators cPanel for Cron Jobs.
The full error is:
/bin/sh: /home3/harryv/public_html/fixturetest.py: urs/bin/env: bad interpreter: No such file or directory
My Cron Jobs command is:
/home3/harryv/public_html/fixturetest.py
If it helps at all, I have realised this isn't a typical "can't find file" error, that error would just say something similar to "No such file or directory", yet this includes a lot more, I sure that my file IS in the correct directory.
Sorry if I had made a simple mistake, I am not very experienced.
EDIT: Embarrassingly I realised that I need to add the word 'python' just before the directory in my Cron Jobs command.
The path to the env executable should (usually) be /usr/bin/env, so your shebang line should read:
#!/usr/bin/env python