Some Empty rows are not visible in pandas dataframe

Some Empty rows are not visible in pandas dataframe - python

Background:
I am reading a large CSV file that contains rows over 40K+ so it required so many modifications in data which I did without any issues because I am using pandas from the last few months.
Issue: In the CSV file it contains in-between so many empty rows which contain an only a single type of hidden character called EOL
Issues: Rows which are ignored by Panda it contain hidden character EOL:
I tried to share the sample data here but the hidden character is getting removed so I am sharing a snapshot where it shows the hidden character.
This website I used to get the above info dostring.com/show-hidden-characters
I went through pretty popular questions in this forum but nothing helped me. Suggest some other solutions
Here's, How I came to know some empty rows are not visible in DF:
checking the empty values count
df.example_column.isnull().sum()
As a result got some count
conversion of column datatype:
df.example_column = df.example_column.astype('str')
When I used above conversion command it marks the empty columns as 'nan'
Again the checked the isnull().sum() count which is zero now
Finally, I took the data out in CSV format but saw some empty rows, this is weird now
Then I use the following command to see the rows at the run-time
df[165:175]
It surprised me again, Row number 168 and 169 are empty when you open it in MS Office. Here in the console, I can only see one row as empty which is 169 and It is marked by Panda as 'nan' at the same time row number 168 is replaced by the data of 167.
This scenario exists in the whole sheet(CSV) and panda is just ignoring the one empty row at the runtime but in MS Office you can see those rows.
I tried to share the sample data here but the hidden character is getting removed so I am sharing a snapshot where it shows the hidden character.
FYI,
here are the settings which I am using while reading the CSV file:
sep=",", skipinitialspace=False, skip_blank_lines=False, encoding='utf-8'

Related

Extraction of complete rows from CSV using list , we dont know row indices

Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards

If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv

CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.

The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.

Manipulating API data in Python before appending to a csv file

I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>

I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.

The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.

How to keep both good and bad lines when loading text file?

I am trying to load a large text file into python dataframe. One thing I noticed is, if I want to load it successfully, I have to drop all the bad lines. But I would like to load all rows first then take a look then clean it manually. Is there a way to do that?
data = pd.read_csv('filename.txt', sep="\t", error_bad_lines=False, engine='python')
Here's warnings I've got. It's a common error, but all solutions are just skipping them, I really need to load all rows... any thought?
Skipping XXX line: Expected 28 fields in line XXX, saw 29

Without knowing more about the specific CSV file, it looks like there is either:
Too many columns in that row (an extra comma)
Quoting is off meaning there's a comma that should be quoted but isn't
The best way to remedy this is to fix the problem in the CSV file.

Technically you're not just loading the file, but also parsing it at the same time. It looks like you've handled the delimiter properly, so as you may have guessed you have too many columns or too few in some of your rows. That may actually be the case, or perhaps you have tabs within text fields that are being interpreted as delimiters.
In any case, pandas isn't going to parse those inconsistent lines.
A typical approach is to open the file in a robust text editor and look at the lines that are erroring out in Pandas. See what's actually wrong and either fix it in the text editor, or use python's native open() function to load the entire file and iterate line by line, with logic that fixes whatever the problem is.
Once you're certain that you have the same number of columns in every row load it with Pandas.

Is it possible to Skip Blank Lines in a Dataframe? If Yes then how I can do this

I am trying to run this code
num = df_out.drop_duplicates(subset=['Name', 'No.']).groupby.(['Name']).size()
But when I do I get this error:
ValueError: not enough values to unpack (expected 2, got 0)
If we think about my dataframe(df_out) as an excel file I do have blank cells but no full column or full row is blank. I needed to skip the blank lines to run the code without changing the dataframe's structure.
Is this possible?
Thank you

Consider using df.dropna(). It is uses to remove rows that contains NA. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html for more information.
At first, you probably want your "blank cells" to be converted to NA value, so they can be dropped by dropna(). This can be done using various methods, notably df.replace(r'\s+', pandas.np.nan, regex=True). If your "blank cells" are all empty strings, or fixed strings equal to some value s, you can directly use (first case) df.replace('', pandas.np.nan) or (second case) df.replace(s, pandas.np.nan).

Using Python to manipulate csv files: vlookup from another csv, insert columns, delete rows, loop

I have 100 csv files, each contains publication data of different institutions and I would like to perform the same manipulation on all of them:
1.Get the Institution name from cell B1. This is always after 'at' or 'at the'. For example 'Publications at Tohoku University'
2.Vlookup the matching InstitutionCode from another csv file called 'Codes'.
For example '1286'. (for Tohoku University).
3.Delete rows 1-14 (including the Insitution name in cell B1)
4. Insert two extra columns (column A and B) to the file with he following headers: 'Institution' and 'InstitutionCode' and fill it with the relevant information for all rows where I have data.
(In the above example Tohoku University and 1286).
I am new to Python and find it hard to put together this script from the resources I have found.
Can anyone please help me?
Below is image of the data in original format
Below is the image of the result required

I could give you the code, but instead, I'll explain to you how you can write it yourself.
Read the Codes file and store the institutions and codes in a dictionary.
You can read more about reading csv files here: https://pymotw.com/2/csv/ or here: https://pymotw.com/3/csv/.
Each row will be represented as a list of strings, so you can access cell elements by their index. Make the Institution names the keys and the codes the values.
Read the csv files one by one in a for loop. I'll call these the input files. Open a new file for writing for each input file that you read. I'll call these the output files.
Loop over the rows in the csv file. You can keep track of the row numbers by using enumerate. You can find info on this here for example: http://book.pythontips.com/en/latest/enumerate.html.
Get the contents of cell B1 by taking element 1 from row 0.
Find the Institution name by using a regular expression. More info here for example: http://dev.tutorialspoint.com/python/python_reg_expressions.htm
And get the Institution code from the dictionary you made in step 1.
Keep looping over the rows, until the first element equals 'Title'. This row contains the headers. Write "Institution" and "InstitutionCode" to the output file, followed by the headers you just found. To do this, convert your row (a list of strings) to a tuple (http://www.tutorialspoint.com/python/python_tuples.htm) and give that as an argument to the writerow method of the csv writer object (see the links in step 1).
Then for each row after the header row, make a tuple of the Institution name and code, followed by the information from the row from the input file you just read, and give that as an argument to the writerow method of the csv writer object.
Close output file.
One thing to think about is whether you want quotes around the cell contents in the output files. You can read about this in the links in step 1. The same goes for the field delimiters. If you don't specify anything, they are assumed to be commas, but you can change this.
I hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.