Manipulating API data in Python before appending to a csv file

Manipulating API data in Python before appending to a csv file - python

I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>

I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.

The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.

Related

Extraction of complete rows from CSV using list , we dont know row indices

Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards

If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv

CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.

The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.

If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Some Empty rows are not visible in pandas dataframe

Background:
I am reading a large CSV file that contains rows over 40K+ so it required so many modifications in data which I did without any issues because I am using pandas from the last few months.
Issue: In the CSV file it contains in-between so many empty rows which contain an only a single type of hidden character called EOL
Issues: Rows which are ignored by Panda it contain hidden character EOL:
I tried to share the sample data here but the hidden character is getting removed so I am sharing a snapshot where it shows the hidden character.
This website I used to get the above info dostring.com/show-hidden-characters
I went through pretty popular questions in this forum but nothing helped me. Suggest some other solutions
Here's, How I came to know some empty rows are not visible in DF:
checking the empty values count
df.example_column.isnull().sum()
As a result got some count
conversion of column datatype:
df.example_column = df.example_column.astype('str')
When I used above conversion command it marks the empty columns as 'nan'
Again the checked the isnull().sum() count which is zero now
Finally, I took the data out in CSV format but saw some empty rows, this is weird now
Then I use the following command to see the rows at the run-time
df[165:175]
It surprised me again, Row number 168 and 169 are empty when you open it in MS Office. Here in the console, I can only see one row as empty which is 169 and It is marked by Panda as 'nan' at the same time row number 168 is replaced by the data of 167.
This scenario exists in the whole sheet(CSV) and panda is just ignoring the one empty row at the runtime but in MS Office you can see those rows.
I tried to share the sample data here but the hidden character is getting removed so I am sharing a snapshot where it shows the hidden character.
FYI,
here are the settings which I am using while reading the CSV file:
sep=",", skipinitialspace=False, skip_blank_lines=False, encoding='utf-8'

Using Python to manipulate csv files: vlookup from another csv, insert columns, delete rows, loop

I have 100 csv files, each contains publication data of different institutions and I would like to perform the same manipulation on all of them:
1.Get the Institution name from cell B1. This is always after 'at' or 'at the'. For example 'Publications at Tohoku University'
2.Vlookup the matching InstitutionCode from another csv file called 'Codes'.
For example '1286'. (for Tohoku University).
3.Delete rows 1-14 (including the Insitution name in cell B1)
4. Insert two extra columns (column A and B) to the file with he following headers: 'Institution' and 'InstitutionCode' and fill it with the relevant information for all rows where I have data.
(In the above example Tohoku University and 1286).
I am new to Python and find it hard to put together this script from the resources I have found.
Can anyone please help me?
Below is image of the data in original format
Below is the image of the result required

I could give you the code, but instead, I'll explain to you how you can write it yourself.
Read the Codes file and store the institutions and codes in a dictionary.
You can read more about reading csv files here: https://pymotw.com/2/csv/ or here: https://pymotw.com/3/csv/.
Each row will be represented as a list of strings, so you can access cell elements by their index. Make the Institution names the keys and the codes the values.
Read the csv files one by one in a for loop. I'll call these the input files. Open a new file for writing for each input file that you read. I'll call these the output files.
Loop over the rows in the csv file. You can keep track of the row numbers by using enumerate. You can find info on this here for example: http://book.pythontips.com/en/latest/enumerate.html.
Get the contents of cell B1 by taking element 1 from row 0.
Find the Institution name by using a regular expression. More info here for example: http://dev.tutorialspoint.com/python/python_reg_expressions.htm
And get the Institution code from the dictionary you made in step 1.
Keep looping over the rows, until the first element equals 'Title'. This row contains the headers. Write "Institution" and "InstitutionCode" to the output file, followed by the headers you just found. To do this, convert your row (a list of strings) to a tuple (http://www.tutorialspoint.com/python/python_tuples.htm) and give that as an argument to the writerow method of the csv writer object (see the links in step 1).
Then for each row after the header row, make a tuple of the Institution name and code, followed by the information from the row from the input file you just read, and give that as an argument to the writerow method of the csv writer object.
Close output file.
One thing to think about is whether you want quotes around the cell contents in the output files. You can read about this in the links in step 1. The same goes for the field delimiters. If you don't specify anything, they are assumed to be commas, but you can change this.
I hope this helps!

Huge text file to small excel files

I have a huge text file (4 GB), where each "line" is of the syntax:
[number] [number]_[number] [Text].
For example
123 12_14 Text 1
1234 13_456 Text 2
33 12_12 Text 3
24 678_10 Text 4
My purpose is to have this data saved as Excel file, where each "line" in the text file,
is a row in the excel file. According to the past example:
[A1] 123
[B1] 12_14
[C1] Text 1
[A2] 1234
[B2] 13_456
[C2] Text 2
[A3] 33
[B3] 12_12
[C3] Text 3
[A4] 24
[B4] 678_10
[C4] Text 4
My plan is to iterate the text "lines", as advised here, separate the "lines",
and save to the cells in an excel file.
Because of the text size issue, I thought to create many small excel files, which all together will be equal to the text file.
Than I need to analyze the small excel files, mainly found terms that where mentioned in the [Text] cells, and count the number of apperance, related to the [number] cells (representing a post and ID of a post).
Finally, I need to sum all this data in an excel file.
I'm considering the best way to create and analyze the excel files.
As mentioned here the main libraries are xlrd and csv.

"I'm pretty sure I don't have other options than small excel files, but what will be the another approach?"
Your huge text file is a type of database, although an inconvenient one. A bunch of small Excel files are another, even less convenient representation of the same database. I assume you are looking to make a bunch of small files because Excel has an upper limit on how many rows it can contain (65'000 or 1'000'000 depending on the version of Excel). However, as has been noted, Excel files are truly horrible database stores.
Since you are already using Python, use module sqlite3, it's already built in and it's a real database, and it can handle more than a million rows. And it's fast.
But I wanted to get an idea how fast it is with data on the scale that you propose so I created a 30M row database of roughly the same complexity as your dataset. The schema is simple:
create table words
(id integer primary key autoincrement not null,
first text, second text, third text);
and populated it with random trigrams drawn from /usr/dict/words (I have a module for generating test data like this which makes entries that look like
sqlite> select * from words limit 5;
1|salvation|prorates|bird
2|fore|embellishment|empathized
3|scalier|trinity|graze
4|exes|archways|interrelationships
5|voguish|collating|partying
but a simple query for a row I knew was near the end took longer than I'd hoped:
select * from words where first == "dole" and second == "licked";
29599672|dole|licked|hates
took about 7 seconds on a pretty average 3-year-old desktop so I added a couple of indexes
create index first on words (first);
create index second on words (second);
which did double the size of the database file from 1.1GB to 2.3GB but brought the simple query time down to a rather reasonable 0.006 second. I don't think you'll do as well with Excel.
So parse your data however you must, but then put it in a real database.

What is the issue with just looping through the file line by line? If you have your heart set on excel I would reccomend openpyxl.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.