CSV dialect in pandas DataFrame to_csv (python) - python

I'm happy to use csv.Dialect objects for reading and writing CSV files in python. My only problem with this now is the following:
it seems like I can't use them as a to_csv parameter in pandas
to_csv and Dialect (and read_csv) parameters are different (eg. to_csv have sep instead of delimiter)... so generating a key-value parameterlist doesn't seem to be a good idea
So I'm a little lost here, what to do.
What can I do if I have a dialect specified but I have a pandas.DataFrame I have to write into CSV? Should I create a parameter mapping by hand?! Should I change to something else from to_csv?
I have pandas-0.13.0.
Note: to_csv(csv.reader(..., dialect=...), ...) didn't work:
need string or buffer, _csv.writer found

If you have a CSV reader, than you don't need to also do a pandas.read_csv call. You can create a dataframe with a dictionary, so your code would look something like:
csv_dict = # Insert dialect code here to read in the CSV as a dictonary of the format {'Header_one': [1, 2, 3], 'Header_two': [4, 5, 6]}
df = pd.DataFrame(csv_dict)

Related

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Pandas to_excel as variable (without destination file) [duplicate]

This question already has an answer here:
Pandas XLSWriter - return instead of write
(1 answer)
Closed 4 years ago.
I recently had to take a dataframe and prepare it to output to an Excel file. However, I didn't want to save it to the local system, but rather pass the prepared data to a separate function that saves to the cloud based on a URI. After searching through a number of ExcelWriter examples, I couldn't find what I was looking for.
The goal is to take the dataframe, e.g.:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6})
And temporarily store it as bytes in a variable, e.g.:
processed_data = <bytes representing the excel output>
The solution I came up with is provided in the answers and hopefully will help someone else. Would love to see others' solutions as well!
Update #2 - Example Use Case
In my case, I created an io module that allows you to use URIs to specify different cloud destinations. For example, "paths" starting with gs:// get sent to Google Storage (using gsutils-like syntax). I process the data as my first step, and then pass that processed data to a "save" function, which itself filters to determine the right path.
df.to_csv() actually works with no path and automatically returns a string (at least in recent versions), so this is my solution to allow to_excel() to do the same.
Works like the common examples, but instead of specifying the file in ExcelWriter, it uses the standard library's BytesIO to store in a variable (processed_data):
from io import BytesIO
import pandas as pd
df = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
output = BytesIO()
writer = pd.ExcelWriter(output)
df.to_excel(writer) # plus any **kwargs
writer.save()
processed_data = output.getvalue()

How to store complex csv data in django?

I am working on django project.where user can upload a csv file and stored into database.Most of the csv file i saw 1st row contain header and then under the values but my case my header presents on column.like this(my csv data)
I did not understand how to save this type of data on my django model.
You can transpose your data. I think it is more appropriate for your dataset in order to do real analysis. Usually things such as id values would be the row index and the names such company_id, company_name, etc would be the columns. This will allow you to do further analysis (mean, std, variances, ptc_change, group_by) and use pandas at its fullest. Thus said:
import pandas as pd
df = pd.read_csv('yourcsvfile.csv')
df2 = df.T
Also, as #H.E. Lee pointed out. In order to save your model to your database, you can either use the method to_sql in your dataframe to save in mysql (e.g. your connection), if you're using mongodb you can use to_json and then import the data, or you can manually set your function transformation to your database.
You can flip it with the built-in CSV module quite easily, no need for cumbersome modules like pandas (which in turn requires NumPy...)... Since you didn't define the Python version you're using, and this procedure differs slightly between the versions, I'll assume Python 3.x:
import csv
# open("file.csv", "rb") in Python 2.x
with open("file.csv", "r", newline="") as f: # open the file for reading
data = list(map(list, zip(*csv.reader(f)))) # read the CSV and flip it
If you're using Python 2.x you should also use itertools.izip() instead of zip() and you don't have to turn the map() output into a list (it already is).
Also, if the rows are uneven in your CSV you might want to use itertools.zip_longest() (itertools.izip_longest() in Python 2.x) instead.
Either way, this will give you a 2D list data where the first element is your header and the rest of them are the related data. What you plan to do from there depends purely on your DB... If you want to deal with the data only, just skip the first element of data when iterating and you're done.
Given your data it may be best to store each row as a string entry using TextField. That way you can be sure not to lose any structure going forward.

Sorting tables xlsxwriter for python

I have a calculation that creates an excel spreadsheet using xlsxwriter to show results. It would be useful to sort the table after knowing the results.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
I cannot find a way to sort the structures in the xlsx module.
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
Another solution would be reopening the file, sort the stuff and close it again?
import xlsxwriter
workbook=xlsxwriter("Trial.xlsx")
worksheet=workbook.add_worksheet("first")
worksheet.write_number(0,1,2)
worksheet.write_number(0,2,1)
...worksheet.sort
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
I am the author of the module and the short answer is that this can't or shouldn't be done.
It is possible to sort worksheet data in Excel at runtime but that isn't part of the file specification so it can't be done with XlsxWriter.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
That sounds like a reasonable solution to me.
You should process your data before writing it to a Workbook as it is not easily possible to manipulate the data once in the spreadsheet.
The following example would write a column of numbers unsorted:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = [5, 2, 7, 3, 8, 1]
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
But if you first sort the data as follows:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = sorted([5, 2, 7, 3, 8, 1])
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
You would get something like:

Turning a huge TextEdit file of JSON into a Pandas dataframe

I have an extremely large list of JSON files in the form of a TextEdit document, each of which has 6 key-value pairs.
I would like to turn each key-value pair into a column name for a Pandas Dataframe, and list the values under the column.
{'column1': "stuff stuff", 'column2': "details details, ....}
Is there a standard way to do this?
I think you could begin uploading the file into a dataframe with
import pandas as pd
df = pd.read_table(file_name)
I think each column could be created by iterating through each JSON document using groupby.
EDIT: I think the correct approach is to parse each JSON object into a Dataframe, and then create a function to iterate through all JSONs and create one Dataframe.
Take a look at read_json or json_normalize. You would indeed most likely read each file and then use for instance pd.concat to combine them as required.
Something along the below lines should work, depending on what your file looks like (here assuming that each json dictionary makes up a line in the file:
df = pd.DataFrame()
f = open('workfile', 'r')
for line in f:
df = pd.concat([df, pd.read_json(line, orient='columns')])

Categories