Deleting stubborn \r in data frame and creating CSV - python

I am new in the field, and I am having problems getting rid of a mid-string \r in a pandas data frame that I need to export into a CSV file.
Context: I had a CSV file that I downloaded as a report from the database platform we use in my organization. The report is legible to humans, not to computers, so there is all sort of merging, page breaks, and lots of other formatting. I need to clean it to create a SQL database. One of the columns has an ID number that appears divided into two lines when I see it in Excel:
This is how the original CSV looks when viewed in Excel.
I have tried to delete that separation, but I can't do it. When imported as a DataFrame, Python points out there is an "\r" mid-string - like below:
150043\r35
So this is what I have done:
I imported the CSV file:
df = pd.read_csv("Assessment.csv", header=None)
I attempted this:
df.replace("\r\n","", regex=True)
And this:
df.replace("\r","", regex=True)
After both attempts, it seemed that \r had disappeared in the data frame, like below:
15004335
However, when I create a new CSV, it keeps separating the lines:
This is how it looks even after using the replace function:
In the text editor, it looks like this:
,0,1,6,8,13,15,20,27
0,,,,,Student ID: ,150043
35,,
1,Student:,...
How do I get rid of this permanently? Am I missing something?

Related

Python reads only one column from my CSV file

first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.

How to delete multiple rows of a .csv file in jupyter notebook using Python?

Hi so I am very new to coding!
I have a huge .csv file (over 1 million rows) and need to delete all data that is before 1st January 2010 at 00:00.
Have tried googling how to do this but can't seem to find anything that doesn't use row numbers, rather than deleting by the date/time.
I tried:
df [(df['Date Time'].dt.year < 2010-0o1)]
But it came up with a very long error (have screenshotted most of that in the image below:
edit: i have also included a snippet of what the file looks like with the headings
enter image description here
It looks like your file is semi-colon separated rather than comma seperated and so has read all columns as a single heading.
Try df = pd.read_csv(file_path, sep=';')
Similar discussion here:
How to read a file with a semi colon separator in pandas

Convert CSV file to CSV with the same amount of columns, via the command line

I downloaded several CSV files from a finance site. These files are inputs to a python script that I wrote. The rows in the CSV files don't all have the same number of values (i.e) columns. In fact on blank lines there are no values at all.
This is what the first few line of the downloaded file look like :
Performance Report
Date Produced,14-Feb-2020
When I attempt to add the row to a panda dataFrame, the script incurs an error of "mismatched columns".
I got around this by opening up the the files in MAC OSX Numbers and manually exporting each file to CSV. However I don't want to do this each time I download a CSV file from the finance site. I have googled for ways to automate this but have not been successful.
This is what the first few lines of the "Numbers" exported csv file looks like:
,,,,,,,
Performance Report,,,,,,
Date Produced,14-Feb-2020,,,,,
,,,,,,,
I have tried to played with dialect value of the csv.read module but have not been successful.
I also appended the columns manually in the python script but also have not been successful.
Essentially mid way down the CSV file is the table that I place into the dataFrame. Below is an example.
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06,
That table prior to exporting via "Numbers" looks like so:
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06
Above the subtotal row does not have a value in the last column, and does so doe snot represent it as ,"", which would make it so that all rows have an equal number of columns.
Does anyone have any ideas on how I can automate the Numbers export process? Any help would be appreciated. I presume they're varying formats of CSV.
In pandas read_csv you can skip rows. If the number of header rows are consistent then:
pd.read_csv(myfile.csv, skiprows=2)
If the first few lines are not consistent or the problem is actually deeper within the file, then you might experiment with try: and except:. Without more information on what the data file looks like, I can't come up with a more specific example using try: and except:.
There is many ways to do this in your script rather than adding commas by means of seperate programs.
One way is to preprocess the file in memory in your script before using pandas.
Now when you are using pandas you should use the built-in power of pandas.
you have not shared what the actual data rows looks like, and without that noone can actually help you.
I would look into using the following 2 kwargs of 'read_csv' to get the job done.
skiprows as a callable,
i.e. make your own function and use it as a filter to filter unwanted rows away
error_bad_lines set to False to just ignore errors and deal with it after it's in the dataframe

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

How to correct DataFrame header containing whitespaces.

I am new to python and actually started with R.
My problem is that I am unable to debug key errors from my pandas dataframes. Here is part of the code:
I read in a data frame from excel with following commands.
cwd = os.getcwd()
os.chdir(directorytofile)
os.listdir('.')
file = dataset
xl = pd.ExcelFile(file)
df1 = cl.parse('Sheet1')
Now when i want to select a header with a blank space like
Lieferung angelegt am
(It's German sorry for that)
I get the key error. I tried different ways to delete blank spaces in my headers when building the dataframe like:
sep='\s*,\s*'
But it still occurs. Is there a way for me to see where the problems happen?
Obviously its about the blank spaces because for headers without everything works fine.

Categories