Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)
I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()
I have an existing json file in a format of list of dicts.
$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]
And I have a DataFrame
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
I want to save "df" with to_json to append it to file output.json:
df.to_json('output.json', orient='records') # mode='a' not available for to_json
* There is append mode='a' for to_csv, but not for to_json really.
The expected generated output.json file will be:
[{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]
The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_json.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html
You could do this. It will write each record/row as json in new line.
f = open(outfile_path, mode="a")
for chunk_df in data:
f.write(chunk_df.to_json(orient="records", lines=True))
f.close()
No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.
This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.
May be you need to think in terms of orient='records':
def to_json_append(df,file):
'''
Load the file with
pd.read_json(file,orient='records',lines=True)
'''
df.to_json('tmp.json',orient='records',lines=True)
#append
f=open('tmp.json','r')
k=f.read()
f.close()
f=open(file,'a')
f.write('\n') #Prepare next data entry
f.write(k)
f.close()
df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')
To load full data
pd.read_json('output.json',orient='records',lines=True)
I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it).
Code:
if os.path.isfile(dir_to_json_file):
# if exist open read it
df_read = pd.read_json(dir_to_json_file, orient='index')
# add data that you want to save
df_read = pd.concat([df_read, df_to_append], ignore_index=True)
# in case of adding to much unnecessery data (if you need)
df_read.drop_duplicates(inplace=True)
# save it to json file in AppData.bin
df_read.to_json(dir_to_json_file, orient='index')
else:
df_to_append.to_json(dir_to_json_file, orient='index')
Usecase, write big amount of data to JSON file with small memory:
Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.
Solution:
use buffer to store content of each dataframe
use pandas to dump it to text
use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path
n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice#example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
for nth_df in range(n_df):
data = list()
for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
data.append((nth_line, value))
df = pd.DataFrame(data, columns=columns)
buffer = io.StringIO()
df.to_json(
buffer,
orient="records",
lines=True,
)
with open(f.abspath, "a") as file:
file.write(buffer.getvalue())