For a current project, I am working with a large Pandas DataFrame sourced from a JSON file.
As soon as calling specific objects of the JSON file within Pandas, I am getting key errors such as KeyError: 'date' for line df['date'] = pd.to_datetime(df['date']).
I have already excluded the identifier/object wording as a possible source for the error. Is there any smart tweak to make this code work?
The JSON file has the following structure:
[
{"stock_symbol": "AMG", "date": "2013-01-01", "txt_main": "ABC"}
]
And the corresponding code section looks like this:
import string
import json
import pandas as pd
# Loading and normalising the input file
file = open("sp500.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df = pd.DataFrame().fillna("")
# Datetime conversion
df['date'] = pd.to_datetime(df['date'])
Take a look at the documentation examples of fillna function fillna function.
By doing df = pd.DataFrame().fillna("") you are overriding your previous df with a new (empty) dataframe. You can just apply it this way: df = df.fillna("").
In [42]: import string
...: import json
...: import pandas as pd
...:
...: # Loading and normalising the input file
...: #file = open("sp500.json", "r")
...: #data = json.load(file)
...: df = pd.json_normalize(a)
...: #df = pd.DataFrame().fillna("")
...:
...: # Datetime conversion
...: df['date'] = pd.to_datetime(df['date'])
In [43]: df
Out[43]:
stock_symbol date txt_main
0 AMG 2013-01-01 ABC
df = pd.DataFrame().fillna("") creates a new empty dataframe and fills "NaN" with empty.
So, change that line to df = df.fillna("")
You are using df = pd.DataFrame().fillna("") which will create a new dataframe and fill an with no value.
Here the old df is replaced by empty dataframe, so there is no column named date. Instead, you can use to fill 'na' values using df.fillna("").
import string
import json
import pandas as pd
# Loading and normalising the input file
file = open("sp500.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df = df.fillna("")
# Datetime conversion
df['date'] = pd.to_datetime(df['date'])
Thank you
Related
I have a code that merges all txt files from a directory into a dataframe
follow the code below
import pandas as pd
import os
import glob
diretorio = "F:\PROJETOS\LOTE45\ARQUIVOS\RISK\RISK_CUSTOM_FUND_N1"
files = []
files = [pd.read_csv(file, delimiter='\t')
for file in glob.glob(os.path.join(diretorio ,"*.txt"))]
df = pd.concat(files, ignore_index=True)
df
that gives result to this table
I needed to add a date column to this table, but I only have the date available at the end of the filename.
How can I get the date at the end of the filename and put it inside the dataframe.
I have no idea how to do this
Assuming the file naming pattern is constant, you can parse the end of the filename for every iteration of the loop this way :-
from datetime import datetime
files = []
for file in glob.glob(os.path.join(diretorio ,"*.txt")):
df_f = pd.read_csv(file, delimiter='\t')
df_f['date'] = datetime.strptime(file[-11:-4], "%d%m%Y")
files.append(df_f)
df = pd.concat(files, ignore_index=True)
import pandas as pd
import os
diretorio = "F:/PROJETOS/LOTE45/ARQUIVOS/RISK/RISK_CUSTOM_FUND_N1/"
files = []
for filename in os.listdir(diretorio):
if filename.endswith(".csv"):
df = pd.read_csv(diretorio + filename, sep=";")
df['Date'] = filename.split('.')[0].split("_")[-1]
files.append(df)
df = pd.concat(files, ignore_index=True)
print(df)
I have the csv file that have columns with no content just headers. And I want them to be included to resulting DataFrame but pandas cuts them off by default. Is there any way to solve this by using read_csv not read_excell?
IIUC, you need header=None:
from io import StringIO
import pandas as pd
data = """
not_header_1,not_header_2
"""
df = pd.read_csv(StringIO(data), sep=',')
print(df)
OUTPUT:
Empty DataFrame
Columns: [not_header_1, not_header_2]
Index: []
Now, with header=None
df = pd.read_csv(StringIO(data), sep=',', header=None)
print(df)
OUTPUT:
0 1
0 not_header_1 not_header_2
I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.
Can we convert the highlighted INTEGER values to STRING value (refer below link)?
https://i.stack.imgur.com/3JbLQ.png
CODE
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
import pandas as pd
df = pd.read_csv ('newsample2.csv')
df.to_json('myjson2.json', indent=4)
print(df)
Try doing something like this.
import pandas as pd
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
df = pd.read_csv ('newsample2.csv')
df['index'] = df.index
df.to_json('myjson2.json', indent=4)
print(df)
This will take indices of your data and store them in the index column, so they will become a part of your data.
I keep getting the error above even though I tried to convert everything to object or string.
df['temp'] = df['Date'].apply(lambda x: x.strftime('%m/%d/%Y'))
nd = df['Date'].unique() nd = np.array_str
I want to get the unique value in column Date of df to be the column header. I want this value to show as MM/DD/YYYY. The result in Python appears as "0x02F1E978". It should have been 09/25/2017 and I can write the file to Excel at all.
import pandas as pd
import numpy as np
from datetime import date, datetime
path = 'C:/Users/tnguy075/Desktop/Inventory Valuation/'
file1 = 'AH_INDY_COMBINEDINV_VALUE_TIDL.xlsx'
file2 = 'DailyInventoryVal.xlsx'
df = pd.read_excel(path+file1, skiprows=1, dtype={'Valuation': np.float64}, parse_dates=['Date']) #open the daily data
df['temp'] = df['Date'].apply(lambda x: x.strftime('%m/%d/%Y'))#change it to a string in form of MM/DD/YYYY
nd = df['Date'].unique()
nd = np.array_str
df = pd.pivot_table(df,index=["Key"], values=["Valuation"],aggfunc=np.sum).reset_index()
df.columns = ['Key',nd]
dv = pd.read_excel(path+file2)
dv = dv.merge(df[['Key', nd]], how = 'left')#Merge data from file 1 using Key
dv.to_excel('path+DaivalReport.xlsx')