I have downloaded the har file of an interactive chart and have the datapoints in the following format:
'{"x":"2022-03-28T00:00:00Z"', '"value":0.2615}',
'{"x":"2022-03-29T00:00:00Z"', '"value":0.2573}',
'{"x":"2022-03-30T00:00:00Z"', '"value":0.272}', ...
What would be the easiest way to convert this into a pandas dataframe?
Both the date and the value should be columns of the dataframe.
First problem is that every element is in inside ' ' so it treads it as two items/columns but it should treat it as single item/doctionary. It may need to replace ', ' with , to have normal string with JSON which you can conver to Python dictionary using module json
text = open(filename).read()
text = text.replace("', '", ",")
and later you can use io.StringIO() to load it from text.
It needs quotechar="'" to read it correctly
df = pd.read_csv(io.StringIO(text), names=['data', 'other'], quotechar="'")
next you can convert every JSON string to python dictionary
df['data'] = df['data'].apply(json.loads)
and next convert dictionary to pd.Series which you can split to columns
df[['x','value']] = df['data'].apply(pd.Series)
Finally you may remove columns data, other
del df['data']
del df['other']
Full working example
text = """'{"x":"2022-03-28T00:00:00Z"', '"value":0.2615}',
'{"x":"2022-03-29T00:00:00Z"', '"value":0.2573}',
'{"x":"2022-03-30T00:00:00Z"', '"value":0.272}',"""
import pandas as pd
import io
import json
#text = open(filename).read()
text = text.replace("', '", ",")
#print(text)
# read from string
df = pd.read_csv(io.StringIO(text), names=['data', 'other'], quotechar="'")
# convert string to dictionary
df['data'] = df['data'].apply(json.loads)
# split dictionary in separated columns
df[['x','value']] = df['data'].apply(pd.Series)
# remove some columns
del df['data']
del df['other']
print(df)
Result:
x value
0 2022-03-28T00:00:00Z 0.2615
1 2022-03-29T00:00:00Z 0.2573
2 2022-03-30T00:00:00Z 0.2720
You can also write some part in one line
df[['x','value']] = df['data'].apply(lambda item: pd.Series(json.loads(item)))
or split it separatelly (using .str[index] on dictionary)
df['data'] = df['data'].apply(json.loads)
df['x'] = df['data'].str['x']
df['value'] = df['data'].str['value']
BTW:
you may also need to convert x from string to datetime
df['x'] = pd.to_datetime(df['x'])
Related
First time post here and new to python. My program should take a json file and convert it to csv. I have to check each field for validity. For a record that does not have all valid fields, I need to output those records to file. My question is, how would I take the a invalid data entry and save it to a text file? Currently, the program can check for validity but I do not know how to extract the data that is invalid.
import numpy as np
import pandas as pd
import logging
import re as regex
from validate_email import validate_email
# Variables for characters
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
# Read in json file to dataframe df variable
# Read in data as a string
df = pd.read_json('j2.json', dtype={'string'})
# Find nan values and replace it with string
#df = df.replace(np.nan, 'Error.log', regex=True)
# Data validation check for columns
df['accountValid'] = df['account'].str.contains(nameRegex, regex=True)
df['userNameValid'] = df['userName'].str.contains(nameRegex, regex=True)
df['valid_email'] = df['email'].apply(lambda x: validate_email(x))
df['valid_number'] = df['phone'].apply(lambda x: len(str(x)) == 11)
# Prepend 86 to phone number column
df['phone'] = ('86' + df['phone'])
Convert dataframe to csv file
df.to_csv('test.csv', index=False)
The json file I am using has thousands of rows
Thank you in advance!
I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.
I am working on a requirement to write my JSON output as [{"x": "MaxTemp", "y": "Temp3pm"}] and my current output looks like [MaxTemp, Temp3pm], so the logis here is, as per the screenshot the first word is X_axis and the second word after comma(,) is y_axis. Below is my code and I have attached the screenshot of the input data.
x_y_data = list(selected_ri['index'])
x_y_data
ini_string = {'Imp_features_selected_x_y':x_y_data}
# printing initial json
ini_string = json.dumps(ini_string)
# converting string to json
final_dictionary = json.loads(ini_string)
you could use str.split to split text by ',' and expand into two columns, for example:
df = df['index'].str.split(',', expand=True)
# then rename column name to x and y
df.columns = ['x', 'y']
then you can convert it into a dict and output as json at last
data = df.to_dict('records')
ini_string = json.dumps(data)
I have a text file formatted like:
item(1) description="Tofu" Group="Foods" Quantity=5
item(2) description="Apples" Group="Foods" Quantity=10
What's the best way to read this style of format in Python?
Here's one way you could do this in pandas to get a DataFrame of your items.
(I copy-pasted your text file into "test.txt" for testing purposes.)
This method automatically assigns column names and sets the item(...) column as the index. You could also assign the column names manually, which would change the script a bit.
import pandas as pd
# read in the data
df = pd.read_csv("test.txt", delimiter=" ", header=None)
# set the index as the first column
df = df.set_index(0)
# capture our column names, to rename columns
column_names = []
# for each column...
for col in df.columns:
# extract the column name
col_name = df[col].str.split("=").str[0].unique()[0]
column_names.append(col_name)
# extract the data
col_data = df[col].str.split("=").str[1]
# optional: remove the double quotes
try:
col_data = col_data.replace('"', "")
except:
pass
# store just the data back in the column
df[col] = col_data
# store our new column names
df.columns = column_names
There are probably a lot of ways to do this based on what you're trying to accomplish and how much variation you expect in the data.
Convert a Pandas Dataframe to a text string with comma delimeters and multiple rows
df = df.to_string()
email.send(text=df)
df columns = No Client_Name Warehouse_Area Location OEM
Expected result = No,Client_Name,Warehouse_Area,Location,OEM
You can use this to convert dataframe object to tab separated string
df_string = dataframe.to_csv(index=False, header=False, sep='\t')
Its quite simple just use to_csv to output string
df.to_csv(index=False)
You can do something like this:
df_string = df.to_string(header=False, index=False, index_names=False).split('\n')
vals = [','.join(x.split()) + '\n' for x in df_string]
email.send(text=vals)