Save dataframe as CSV in Python - python

I am trying to save the result of this code as a CSV file:
import pandas as pd
df = pd.DataFrame({'ID': ['a01', 'a01', 'a01', 'a01', 'a01', 'a01', 'a01', 'a01', 'a01', 'b02', 'b02','b02', 'b02', 'b02', 'b02', 'b02'],
'Row': [1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2, 3, 3, 3],
'Col': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 3, 1, 3, 1, 2, 3],
'Result': ['p', 'f', 'p', 'p', 'p', 'f', 'p', 'p', 'p', 'p', 'p', 'p', 'f', 'p', 'p', 'p']})
dfs = {}
for n, g in df.groupby('ID'):
dfs[n] = g.pivot('Row', 'Col', 'Result').fillna('')
print(f'ID: {n}')
print(dfs[n])
print('\n')
print(dfs[n].stack().value_counts().to_dict())
print('\n')
I found several methods and tried to save the output (dictionary form) into a CSV file, but without success. Any thoughts?
P.S. This is one of the methods I found, but I didn't know how to name the column based on my output?
with open("Output.csv", "w", newline="") as csv_file:
cols = ["???????????"]
writer = csv.DictWriter(csv_file, fieldnames=cols)
writer.writeheader()
writer.writerows(data)

df.to_csv('Output.csv', index = False)
For more details goto:
https://datatofish.com/export-dataframe-to-csv/
https://www.geeksforgeeks.org/saving-a-pandas-dataframe-as-a-csv/

Use the method provided by pandas data frame abject
df.to_csv()

You can use df.to_csv() to convert your data to csv.

Related

Python: parallelize for loop and save results in dictionary passed by reference

I would like to parallelize following function, where A and B are just some columns from my input data frame. I would like to provide the output dictionary as an input so it is filled within the function (pass by reference)
outdict = {}
inputdict1 = {'id1': 100, 'id2': 200, 'id3': 0}
inputdict2 = {'id1': ['cat'], 'id2': ['dog', 'rabbit'], 'id3': []}
inputdf = pd.DataFrame({'A': ['cat', 'cat', 'dog', 'dog', 'dog', 'cat', 'rabbit', 'rabbit'],
'B': ['a', 'b', 'b', 'c', 'c', 'd', 'e', 'f']})
def processing(outdict, inputdict1, inputdict2, inputdf):
for key, _ in tqdm(inputdict1.items()):
outdict[key] = inputdf[inputdf.A.isin(inputdict2[key])].B.nunique()
processing(outdict, inputdict1, inputdict2, inputdf)
print(outdict)
{'id1': 3, 'id2': 4, 'id3': 0}
After some research I have tried the following approach
from multiprocessing import Pool
def processing(outdict, inputdict1, inputdict2, inputdf):
for key, value in tqdm(inputdict1.items()):
outdict[key] = inputdf[inputdf.A.isin(inputdict2[key])].B.nunique()
outdict = {}
pool = Pool()
pool.starmap(processing, zip(outdict, inputdict1, inputdict2, inputdf))
print(outdict)
{}

Pandas dropna() not removing entire row

When I serached a way to remove an entire column in pandas if there is a null/NaN value, the only appropriate function I found was dropna(). For some reason, it's not removing the entire row as intended, but instead replacing the null values with zero. As I want to discard the entire row to then make a mean age of the animals from the dataframe, I need a way to not count the NaN values.
Here's the code:
import numpy as np
import pandas as pd
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, labels)
df.dropna(inplace= True)
df.head()
In this case, I need to delete the Dog 'd' and Cat 'h'. But the code that comes out is:
To note I have also done this, and it didn't work either:
df2 = df.dropna()
you have to specify the axis = 1 and any to remove column
see : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
df.dropna(axis=1, inplace= True, how='any')
if you want just delet the row :
df.dropna(inplace= True, how='any')

Create a DataFrame birds from this dictionary data which has the index labels

Consider the following Python dictionary data and Python list labels:**
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Create a DataFrame birds from this dictionary data which has the index labels using Pandas
Assuming your dictionary is already ordered into the correct ordering for the labels
import pandas as pd
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
data['labels'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, columns=['birds', 'age', 'visits', 'priority', 'labels'])
df.set_index('labels')
Try the code below,
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {
'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
'labels' : labels
}
df = pd.DataFrame.from_dict(data)
df.set_index('labels')
You can reduce some code like:
DataFrame provides us a flexibilty to provide some values like data,columns,index and the list goes on.
If we are dealing with Dictionary, then by default dictionaries keys are treated as column and values will be as rows.
In following Code I have used name attribute through DataFrame object
df=pd.DataFrame(data,index=Labels) # Custom indexes
df.index.name='labels' # After Running df.index.name you will get index as none, by this approach you can set any name to the column
I hope this will be help full for you.
Even I encountered the same exact issue few days back and we have a very beautiful library to handle dataframes and is better than pandas.
Search for turicreate in python, it is very very similar to the pandas but has a lot more to offer than pandas.
You can define the Sframes in Turienter image description here, somewhat similar to the pandas dataframe. After that you just have to run:
dataframe_name.show()
.show() visualizes any data structure in Turi Create.
You can visit the mentioned notebook for a better understanding: https://colab.research.google.com/drive/1DIFmRjGYx0UOiZtvMi4lOZmaBMnu_VlD
You can try out this.
import pandas as pd
import numpy as np
from pandas import DataFrame
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
df=DataFrame(data,index=labels)
print(df)

How do you return only a group by in pandas?

I have the following script of which I want a simple group by:
# import the pandas module
import pandas as pd
from openpyxl import load_workbook
writer = pd.ExcelWriter(r'D:\temp\test.xlsx', engine='openpyxl')
# Create an example dataframe
raw_data = {'Date': ['2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13','2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13'],
'Portfolio': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'Duration': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'Yield': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1],}
df = pd.DataFrame(raw_data, columns = ['Date', 'Portfolio', 'Duration', 'Yield'])
dft = df.groupby(['Date', 'Portfolio', 'Duration', 'Yield'], as_index =False)
This creates a pandas group by object.
I then want to output this to excel:
dft.to_excel(writer, 'test', index=False)
writer.save()
However it returns an error:
AttributeError: Cannot access callable attribute 'to_excel' of 'DataFrameGroupBy' objects, try using the 'apply' method
Why would I need a apply? I only want the group by results to remove duplicates.
You can indeed drop duplicates using groupby, by taking the first or the mean of each group, like:
df.groupby(['Date', 'Portfolio', 'Duration', 'Yield'], as_index=False).mean()
df.groupby(['Date', 'Portfolio', 'Duration', 'Yield'], as_index=False).first()
Note that you have to apply a function (in this case using the mean or first methods) to get back a DataFrame from the groupby object. This can then be written to excel.
But as #EdChum notes, in this case using the drop_duplicates method of a dataframe is the easier approach:
df.drop_duplicates(subset=['Date', 'Portfolio', 'Duration', 'Yield'])

Filter DataFrame of dtype object using pandas

I'm required to parse data using the following operations.
data=[{'a': 1,
'b': {1: 1,
2: 2},
'c': ['q', 'w', 'e', 'r', 't', 'y']},
{'a': 2,
'b': {1: 2,
2: 3},
'c': ['q', 't', 'a', 'v', 'o', 'l']}]
df = pd.DataFrame(data)
I want to get data that satisfy a condition as follows:
print(df['q' in df.c].head())
However, I get an error:
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: False
Why wouldn't this work?
I'm confused as the following code would work unlike when parsing through the object dtype:
print(df[df.a == 1].head())
You can use apply on the column to generate a boolean mask describing the desired columns, and then filter the DataFrame by this mask:
>>> df[df.c.apply(lambda val: 'q' in val)]
a b c
0 1 {1: 1, 2: 2} [q, w, e, r, t, y]
1 2 {1: 2, 2: 3} [q, t, a, v, o, l]
in is used for index checking. For values you can use str.contains():
df.c.str.contains("q", regex=False)

Categories