Pandas itertuples are not named tuples as expected? - python

Using this page from the Pandas documentation, I wanted to read a CSV into a dataframe, and then turn that dataframe into a list of named tuples.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html?highlight=itertuples
I ran the code below...
import pandas as pd
def csv_to_tup_list(filename):
myfile = filename
df = pd.read_csv(myfile,sep=',')
df.columns = ["term", "code"]
tup_list = []
for row in df.itertuples(index=False, name="Synonym"):
tup_list.append(row)
return (tup_list)
test = csv_to_tup_list("test.csv")
type(test[0])
... and the type returned is pandas.core.frame.Synonym, not named tuple. Is this how it is supposed to work, or am I doing something wrong?
My CSV data is just two columns of data:
a,1
b,2
c,3
for example.

"Named tuple" is not a type. namedtuple is a type factory. pandas.core.frame.Synonym is the type it created for this call, using the name you picked:
for row in df.itertuples(index=False, name="Synonym"):
# ^^^^^^^^^^^^^^
This is expected behavior.

Related

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

Use string literal instead of header name in Pandas csv file manipulation

Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?
You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})
use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')

Converting a string representation of dicts to an actual dict

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")

Pandas naming multiple sheets according to list of lists

First time I post here and I am rather a newbie.
Anyhow, I have been playing around with Pandas and Numpy to make some calculations from Excel.
Now I want to create an .xlsx file to which I can output my results and I want each sheet to be named after the name of the dataframe that is being outputted.
This is my code, I tried a couple of different solutions but I can't figure how to write it.
In the code you can see that save_excel just makes numbered sheets (and it works great) and save_excelB tries to do what I am describing it but I can't get it to work.
from generate import A,b,L,dr,dx
from pandas import DataFrame as df
from pandas import ExcelWriter as ew
A=df(A) #turning numpy arrays into dataframes
b=df(b)
L=df(L)
dr=df(dr)
dx=df(dx)
C=[A,b,L,dr,dx] #making a list of the dataframes to iterate through
def save_excel(filename, item):
w=ew(filename)
for n, i in enumerate(item):
i.to_excel(w, "sheet%s" % n, index=False, header=False)
w.save()
def save_excelB(filename, item):
w=ew(filename)
for name in item:
i=globals()[name]
i.to_excel(w, sheet_name=name, index=False, header=False)
w.save()
I run both in the same way I call the function and I add the file name and for item I insert the list C I have made.
So it would be:
save_excelB("file.xlsx", C)
and this is what I get
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
You need to pass string literals of data frame names in your function and not actual data frame objects:
C = ['A', 'b', 'L', 'dr', 'dx']
def save_excelB(filename, item):
w=ew(filename)
for name in item:
i=globals()[name]
i.to_excel(w, sheet_name=name, index=False, header=False)
w.save()
save_excelB("file.xlsx", C)
You can even dynamically create C with all dataframes currently in global environment by checking items that are pandas data frame class type:
import pandas as pd
...
C = [i for i in globals() if type(globals()[i]) is pd.core.frame.DataFrame]

Categories