Python for Data Analysis, Chapter 2, first example

Python for Data Analysis, Chapter 2, first example - python

I'm following along with the examples in a translated version of Wes McKinney's "Python for Data Analysis" and I was blocked in first example of Chapter 2
I think my problem arose because I saved a data file in a wrong path. is that right?
I stored a file, usagov_bitly_data2012-03-16-1331923249.txt, in C:\Users\HRR
and also stored folder, pydata-book-mater, that can be downloaded from http://github.com/pydata-book in C:\Users\HRR\Anaconda2\Library\bin.

Depends.
You might change the location you save your File or eddit the path you give to your code in Line 10. Since you're yousing relativ Paths i guess your script runs in C:\Users\HRR\Anaconda2\Library\bin, which means you have to go back to C:\Users\HRR or use an absolute Path ... or move the File, but hell you don't want to move a file every time you want to open it, like moving word files into msoffice file to open it, so try to change the Path.
And allways try harder ;)

In python open() will open from the current directory down unless given a full path (in linux that starts with / and windows <C>://). In your case the command is open the folder ch02 in the directory the script is running from and then open usagov_bitly_data2012-03-16-1331923249.txt in that folder.
Since you are storing the text file in C:\Users\HRR\usagov_bitly_data2012-03-16-1331923249.txt and you did not specify the directory of the script. I recommend the following command instead open(C:\\Users\\HRR\\usagov_bitly_data2012-03-16-1331923249.txt)
Note: the double \ is to escape the characters and avoid tabs and newlines showing up in the path.

Related

Unknown problem in Python. Printing does not work and files are not saved correctly

When I try to write something, such as variables, the code is renamed to the file name on the computer.
For example, if I write:
a = 20
f = 15
print(a+f)
then the code file will automatically be renamed to the first line, i.e. "a = 20"
Then, when I try to run the code, the program outputs nothing but "Python" and some incomprehensible words.
What could it be related to?
enter image description here
enter image description here
I installed the latest version of Visual Stuio Code with Python, they are new, so there should be no problems. But this time it went wrong.
After reinstalling the program, the problem remains.

First of all, if there is no special requirement, please do not use Code Runner to run the script, using the official extension Python is a better choice.
In addition, the dot on your file label means that you have not saved the file, you can add the following setting to enable automatic saving in the settings.
"files.autoSave": "afterDelay",
You may have created the file using the following method. File --> New File... --> Python File. At this time, the file has not been named, also not saved. You can see that there is no such file in the resource manager list at this time.
So the file label shows the first line of codes. This is a feature of vscode, you can refer to this link. And because the file has not been saved, there will be problems executing the script.
You can rename the script file directly (F2), or vscode will remind you to name the file when saving. Another way to create a file is to right click and choose New File..., enter filename and end with .py extension.

Convert multiple multipage PDFs to JPGs in subfolders

Simple use case:
A folder with many (mostly multipage) PDF files.
A script should convert each PDF page to JPG and store it in a subfolder named after the PDF filename. (e.g. #33.pdf to folder #33)
Single JPG files should also have this filename plus a counter mirroring the sequential page number in the PDF. (e.g. #33_001.jpg)
I found a bounch of related questions, but nothing that quite does what I want, e.g.
How do I convert multiple PDFs into images from the same folder in Python?
A python script would work fine, but also any other way to do this in Win10 (imagemagick, e.g.) is cool with me.

Your comment requests how a batch can do as required, for simplicity the following only processes a single file so Python will need to loop through a folder and call with each name in turn. That could be done by adding a "for loop" in batch but first see where problems arise, as many of my single test files threw differing errors.
I have tried to cover several fails in this batch file, in my system, but there can still be issues such as a file that has no valid fonts to display
For most recent poppler windows 64bit utils see https://github.com/oschwartz10612/poppler-windows/releases/ for 32 bit use xpdf latest version http://www.xpdfreader.com/download.html but that has direct pdftopng.exe so needs a few edits.
pdf2dir.bat
#echo off
set "bin=C:\Apps\PDF\poppler\22.11.0\Library\bin"
set "res=200"
REM for type use one of 3 i.e. png jpeg jpegcmyk (PNG is best for documents)
set "type=png"
if exist "%~dpn1\*.%type%" echo: &echo Files already exist in "%~dpn1" skipping overwrite&goto pause
if not exist "%~dpn1.pdf" echo: &echo "%~dpn0" File "%~dpn1.pdf" not found&goto pause
if not exist "%~dpn1\*.*" md "%~dpn1"
REM following line deliberately opens folder to show progress delete it or prefix with REM for blind running
explorer "%~dpn1"
"%bin%\pdftoppm.exe" -%type% -r %res% "%~dpn1.pdf" "%~dpn1\%~n1"
if %errorlevel%==1 echo: &echo Build of %type% files failed&goto pause
if not exist "%~dpn1\*.%type%" echo: &echo Build of %type% files failed&goto pause
:pause
echo:
pause
:end
It requires Poppler binaries path to pdftoppm be correctly set in the second line
It can be placed wherever desired i.e. work folder or desktop
It allows for drag and drop of one pdf on top will (should work) without need to run in console
Can be run in a command console and place a space character after, you can drag and drop a single filename but any spaces in name must be "double quoted"
can be run from any shell or OS command as "path to/batchfile.bat" "c:\path to\file.pdf"

Strange behavior when trying to create and write to a text file on macOS [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed last year.
I'm opening a plain text file, parsing it, and adding different lines to existing, empty string variables. I add these variables into a new variable that is a multi-line fstring. Trying to write the data to a new text file is not behaving as expected.
Reading the original file works fine. Text is properly parsed, variables populated.
The multi-line fstring variable seems fine. Prints normally. Even tried formatting it different ways which I show below.
When writing to a new file, that's where the strangeness starts. I've tried 2 ways:
Straight coding the open function with w or w+
Adding the above to a function and using that inside main()
The file is saved to disk with the correct name. Trying to double-click open in Finder produces nothing. Right-click to open produces nothing. Trying to move to trash with command+delete gives an error:
It sounds like the file goes to trash, but as the file disappears from the folder a new one is created with the same name in its place.
If I try to open in TextMate via File > Open, it opens as a blank file with no errors.
Since I can't get rid of the file, I have to delete the directory and create the directory again with the same name, or force delete in Terminal using rm. Restarting the system does not help. Relaunching Finder does nothing. Saving text files from other apps works fine. Directory is chmod 755.
If I copy an existing text file into the output directory, rename it to what the file is expected to be named, and let python overwrite the contents, it doesn't work either. The file modification date changes (and I see the file "blink" in Finder) but the contents remain the same. However, the file is not corrupted and opens normally.
If I do the same but delete the text inside of the copied file first, then run the script, python writes no data to the file, I can't open it by double-clicking on it, and I get error -43 again with the odd non-trashing behavior.
The strangest thing is this: if I add another with open() at the end of the script, and open the file that was just created and supposedly written to, and print its contents, the contents print. It's like when the script ends the file contents are being removed or its being corrupted somehow. Tried to close the file inside the script even though it's not needed, but same behavior persists.
Code:
Here's the code for writing:
FORMAT='utf-8'
OUTPUT_DIR = '/Path/To/SaveFolder'
# as a function
def write_to_file(content, fpath, name):
the_file = os.path.join(fpath, name)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(content)
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
write_to_file(multiline_var, OUTPUT_DIR, filename)
# or hard coded in main()
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
the_file = os.path.join(OUTPUT_DIR, filename)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(multiline_var)
I have tried using w w+ wt and wt+ and with and without encoding='utf-8'
Here is an example of multi-line fstring variable:
# using triple quotes
multiline_var = f"""
[PROJ-{pcode}] {full_title} by {author}
{description}
{URL}
{DIVIDER_1}
{TEXT_BLURB}
Some text here and then {SOME_MORE_TEXT}"
{DIVIDER_1}
{SOME_LINK}
"""
# or inside parens
multiline_var = (
f"[PROJ-{pcode}] {full_title} by {author}\n"
f"{description}\n\n"
f"{URL}\n"
f"{DIVIDER_1}\n"
f"{TEXT_BLURB}\n\n"
f"Some text here and then {SOME_MORE_TEXT}\n"
f"{DIVIDER_1}\n\n"
f"{SOME_LINK}"
)
Using exiftool on the text file shows the following, so it looks the data is there but must be corrupted:
File Size : 1797 bytes
File Modification Date/Time : 2021:12:31 15:55:39-05:00
File Access Date/Time : 2021:12:31 15:58:13-05:00
File Inode Change Date/Time : 2021:12:31 15:55:39-05:00
File Permissions : -rw-r--r--
File Type : TXT
File Type Extension : txt
MIME Type : text/plain
MIME Encoding : utf-8
Byte Order Mark : No
Newlines : Unix LF
Line Count : 55
Word Count : 181
Not sure what I'm doing wrong. VScode shows no syntax errors in the script. There are no errors in Terminal when running the script. Have I made some simple mistake in the above code? Maybe the fstring variable is causing a problem?

Thanks to #bnaecker for leading me to the solution to this problem.
It appeared that when creating/writing to a text file with a long name, Python can corrupt it. Not sure why, as I save long names for images with Python image libraries all the time. Using a short name like "MyFile.txt" it worked just fine, but that was a red herring.
I have updated this post with my journey to the final solution for using the long names that are needed for my project, though I'm not sure why the problem exists.
First Attempts:
So far creating using a short name and then renaming to a long one.... attempts have failed. I did notice that python is locking the file it creates and never unlocks it. Not sure if this is the problem. Setting chflags with os.system('chflags nouchg') command does not work, not even with sudo, and not even in the Terminal doing it manually.
Using os.rename() in Python corrupts the file
Using os.system('mv oldFile.txt newFile.txt') corrupts the file
Manually using mv command in Terminal corrupts the file
Manually changing the filename in the Finder does not (wtf?)
I kept looking for workarounds but nothing did the job.
Round 2:
Progress!
After much tinkering, I discovered a hidden character inside the file. I ran cat /path/longfilename.txt in Terminal, selected and copied the output and pasted into VScode. Here is what I saw:
Somehow a hidden character is getting into the project code number.
Pasting it into a Unicode search engine it came up as a ZERO WIDTH NO-BREAK SPACE also known in Unicode as EF BB BF. However, when pasting this symbol into TextMate it shows up as <U+FEFF> which is?...
The Byte Order Mark!
Opening a normal utf-8 text file in a hex editor also shows the files starting with EFBBBF for the BOM.
Now, the text file being read and parsed at first has no blank lines to start the file, so I added a line break, and also tried adding some spaces. This time when writing the file I could open it, however, after sending it to the trash, the same behavior occurred and the file was broken again. It seems that because other corrupted versions were in the trash, it added the symbol back to the file name for some reason.
So what appears to be happening, for whatever reason, when Python opens the text file I'm parsing that has no line break at the top, it seems to be grabbing the BOM from the file and adding that to the first variable which is grabbing the first line of the text file. Since that text is a number code that starts the file name, the BOM symbol is being added to the file name as well as the code inside the text file.
Just... wow
The Current Solution:
I have to leave a blank line at the start of the text file that I'm opening and parsing and a simple line break won't do it. I have no idea why this is. I added some spaces for good measure because randomly the BOM would be added to the variable and filename again. So far (knock on wood) as long as the first line of that initial file has some spaces and then a line break, and previous corrupted files have been deleted from the trash, a long file name can be used for all the files I'm creating and writing to without any problems.
This corruption even persists if I remove the encoding flag from both of the open functions I'm using (one to read and parse, the other to create and write).
If anyone knows why this is happening, please share. I've never seen it mentioned before. I'm not sure if it's a python 3.8 bug, a mac OS bug, the way TextMate wrote the original file, or a combination of these.
Correct Solution:
Thanks to #tripleee for the proper way to handle this, as I don't remember seeing this before, though I haven't been using python for very long.
In order to ignore the BOM, reading in the text file to be parsed with an encoding='utf-8-sig' does the job. Seems to be why it exists. :)
Problem solved.

ValueError: need more than 0 values to unpack (Python 2)

I am trying to replicate another researcher's findings by using the Python file that he added as a supplement to his paper. It is the first time I am diving into Python, so the error might be extremely simple to fix, yet after two days I haven't still. For context, in the Readme file there's the following instruction:
"To run the script, make sure Python2 is installed. Put all files into one folder designated as “cf_dir”.
In the script I get an error at the following lines:
if __name__ == '__main__':
cf_dir, cf_file, cf_phys_file = sys.argv[1:4]
os.chdir(cf_dir)
cf = pd.read_csv(cf_file)
cf_phys = pd.read_csv(cf_phys_file)
ValueError: need more than 0 values to unpack
The "cf_file" and "cf_phys_file" are two major components of all files that are in the one folder named "cf_dir". The "cf_phys_file" relates only to two survey question's (Q22 and Q23), and the "cf_file" includes all other questions 1-21. Now it seems that the code is meant to retrieve those two files from the directory? Only for the "cf_phys_file" the columns 1:4 are needed. The current working directory is already set at the right location.
The path where I located "cf_dir" is as follows:
C:\Users\Marc-Marijn Ossel\Documents\RSM\Thesis\Data\Suitable for ML\Data en Artikelen\Per task Suitability for Machine Learning score readme\cf_dir
Alternative option in readme file,
In the readme file there's this option, but also here I cannot understand how to direct the path to the right location:
"Run the following command in an open terminal (substituting for file names
below): python cfProcessor_AEAPnP.py cf_dir cf_file cf_phys_file task_file jobTaskRatingFile
jobDataFile OESfile
This should generate the data and plots as necessary."
When I run that in "Command Prompt", I get the following error, and I am not sure how to set the working directory correctly.
- python: can't open file 'cfProcessor_AEAPnP.py': [Errno 2] No such file or directory
Thanks for the reading, and I hope there's someone who could help me!
Best regards & stay safe out there during Corona!!
Marc

cf_dir, cf_file, cf_phys_file = sys.argv[1:4]
means, the python file expects few arguments when called.
In order to run
python cfProcessor_AEAPnP.py cf_dir cf_file cf_phys_file task_file jobTaskRatingFile jobDataFile OESfile
the command prompt should be in that folder.
So, open command prompt and type
cd path_to_the_folder_where_ur_python_file_is_located
Now, you would have reached the path of the python file.
Also, make sure you give full path in double quotes for the arguments.

Python won't create / write to a file

I am new at python and learning the language. The following code should create a file in the running program directory and write to it but it doesn't do this at all in a .py file. If I put the same code in the IDLE shell it returns 17. No errors just doesn't create the file. What am I doing wrong?
with open("st.txt", "w") as f:
f.write("Hi from Python!")
Thanks for the help
Mike

This code is flawless, no problem!
I guess that in your REPL shell, the $PWD environment variable is set for somewhere, so your destination file is in some corner.
No exception thrown indicates that no problem with access authority.
Maybe you can set some absolute path string, such as ~/st.txt
By the way, the successful invoke should return 15 instead of 17, totally count 15 chars.

your code works well, st.txt will be touched at executing path.
other ways, your system account can't write in your execute path.
try in your $HOME path to execute your code, I think, It will work well

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.