Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?
You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})
use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')
Related
I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:
{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}
I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data
cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)
I get a null dataframe each time I run the above code. Please help.
Tried below stuff and it didn't work:
PySpark: Read nested JSON from a String Type Column and create columns
Also tried to write it to a JSON file and read it. It didn't work as well:
reading a nested JSON file in pyspark
The null chars \u0000 affect the parsing of the JSON. You can replace them as well:
df = spark.read.json('path')
df2 = df.withColumn(
'cleansed_value',
F.regexp_replace('value','[\u0000/]','')
).withColumn(
'parsed',
F.from_json('cleansed_value','context string')
)
df2.show(20,0)
+-----------------------+------------------+------+
|value |cleansed_value |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+
import pandas as pd
xl=pd.ExcelFile('/Users/denniz/Desktop/WORKINGPAPER/FDIPOLITICS/python.xlsx')
dfs = pd.read_excel(xl,sheet_name=None, dtype={'COUNTRY':str,'YEAR': int, 'govtcon':float, 'trans':float},na_values = "Missing")
dfs.head()
After running the code above i got the following:
collections.OrderedDict object has no attribute 'head'
sheet_name = None will not work and you can combine reading excel file lines like this.
import pandas as pd
import xlrd
dfs=pd.read_excel('/Users/denniz/Desktop/WORKINGPAPER/FDIPOLITICS/python.xlsx',sheet_name=0, dtype={'COUNTRY':str,'YEAR': int, 'govtcon':float, 'trans':float},na_values = "Missing")
dfs.head()
I have read the API reference of pandas.read_excel. pandas.read_excel method will return DataFrame or dict of DataFrames.
As you set sheet_name=None, you will get All sheets returned in the form of a dict of DataFrames. The key of this dict will be the sheet name.
So in your code snippet, dfs is a dict not a DataFrames. Obviously, dict has no head method. Your code should be like this dfs[sheet_name].head().
I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.
I have read a Matlab file containing a large amount of arrays as a dataset into Python storing the Matlab Dictionary under the variable name mat using the command:
mat = loadmat('Sample Matlab Extract.mat')
Is there a way I can then use Python's write to csv functionality to save this Matlab dictionary variable I read into Python as a comma separated file?
with open('mycsvfile.csv','wb') as f:
w = csv.writer(f)
w.writerows(mat.items())
f.close()
creates a CSV file with one column containing array names within the dictionary and then another column containing the first element of each corresponding array. Is there a way to utilize a command similar to this to obtain all corresponding elements within the arrays inside of the 'mat' dictionary variable?
The function scipy.io.loadmat generates a dictionary looking something like this:
{'__globals__': [],
'__header__': 'MATLAB 5.0 MAT-file, Platform: MACI, Created on: Wed Sep 24 16:11:51 2014',
'__version__': '1.0',
'a': array([[1, 2, 3]], dtype=uint8),
'b': array([[4, 5, 6]], dtype=uint8)}
It sounds like what you want to do is make a .csv file with the keys "a", "b", etc. as the column names and their corresponding arrays as data associated with each column. If so, I would recommend using pandas to make a nicely formatted dataset that can be exported to a .csv file. First, you need to clean out the commentary members of your dictionary (all the keys beginning with "__"). Then, you want to turn each item value in your dictionary into a pandas.Series object. The dictionary can then be turned into a pandas.DataFrame object, which can also be saved as a .csv file. Your code would look like this:
import scipy.io
import pandas as pd
mat = scipy.io.loadmat('matex.mat')
mat = {k:v for k, v in mat.items() if k[0] != '_'}
data = pd.DataFrame({k: pd.Series(v[0]) for k, v in mat.items()}) # compatible for both python 2.x and python 3.x
data.to_csv("example.csv")
This is correct solution for converting any .mat file into .csv file. Try it
import scipy.io
import numpy as np
data = scipy.io.loadmat("file.mat")
for i in data:
if '__' not in i and 'readme' not in i:
np.savetxt(("file.csv"),data[i],delimiter=',')
import scipy.io
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
class MatDataToCSV():
def init(self):
pass
def convert_mat_tocsv(self):
mat = scipy.io.loadmat('wiki.mat')
instances = mat['wiki'][0][0][0].shape[1]
columns = ["dob", "photo_taken", "full_path", "gender",\
"name", "face_location", "face_score", "second_face_score"]
df = pd.DataFrame(index = range(0,instances), columns = columns)
for i in mat:
if i == "wiki":
current_array = mat[i][0][0]
for j in range(len(current_array)):
df[columns[j]] = pd.DataFrame(current_array[j][0])
return df
reading a matfile (.MAT) with the below code
data = scipy.io.loadmat(files[0])
gives a dictionary of values and keys
and " 'header', 'version', 'globals'" these are some of the default values which we need to remove
cols=[]
for i in data:
if '__' not in i :
cols.append(i)
temp_df=pd.DataFrame(columns=cols)
for i in data:
if '__' not in i :
temp_df[i]=(data[i]).ravel()
we remove the unwanted header values using "if '__' not in i:" and then make a dataframe using the rest of the headers and finally assign the column values to respective column headers
I have successfully finished data manipulation using pandas (in python). Depending on my starting data set I end up with a series of data frames - let's say for example sampleA, sampleB, and sample C.
I want to automate saving of these datasets (can be a lot of them) with also a unique identifier in the name
so I create a list of pandas, and use a loop to save the data - I cannot though make the loop give a unique name each time - see for example:
import numpy as np
import pandas as pd
sampleA= pd.DataFrame(np.random.randn(10, 4))
sampleB= pd.DataFrame(np.random.randn(10, 4))
sampleC= pd.DataFrame(np.random.randn(10, 4))
allsamples=(sampleA, sampleB, sampleC)
for x in allsamples:
#name = allsamples[x]
#x.to_csv(name + '.dat', sep=',', header = False, index = False)
x.to_csv(x + '.dat', sep=',', header = False, index = False)
when I am using the above (with not the commented lines) all data are saved as x.data, and I keep only the latest dataset; if I do the name line, then i get errors
any idea how I can come up with a naming approach so I can save 3 files named sampleA.dat, sampleB.data, and sampleC.dat
If you use strings, then you can look up the variable of the same name using vars():
allsamples = ('sampleA', 'sampleB', 'sampleC')
for name in allsamples:
df = vars()[name]
df.to_csv(name + '.dat', sep=',', header=False, index=False)
Without an argument vars() is equivalent to locals(). It returns a "read-only" dict mapping local variable names to their associated values. (The dict is "read-only" in the sense that it is mainly useful for looking up the value of local variables. Like any dict, it is modifiable, but modifying the dict will not modify the variable.)
Be aware that python tuple items have no names. And moreover, allsamples[x] is meaningless, you index tuple with a dataframe, what do you expect to get?
One can use a dictionary instead of a tuple for simultanious variables naming and storing:
all_samples = {'sampleA':sampleA, 'sampleB':sampleB, 'sampleC':sampleC}
for name, df in all_samples.items():
df.to_csv('{}.dat'.format(name), sep=',', header = False, index = False)