Converting a string representation of dicts to an actual dict

Converting a string representation of dicts to an actual dict - python

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)

You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))

parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Related

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?

you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})

Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

How can I see a list of the variables in a CSV column?

I have a csv file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to have a code to list the variables in a specific column.
For example, I'd like it to return {Kish, Qeshm, Tabriz} for the 'city' column.

You need to first to import the csv module into your python file and read over each row in the file and save it in a list, so it'll be like
import csv
cities = []
with open("yourfile.csv", "r") as file:
reader = csv.DictReader(file) //This will save the values in the very top of the csv file as header so it will skip a line
for row in reader:
city = row["City"]
cities.append(city)
this will give you a list of cities=[Kish, Qesh, Tabriz, ....]

It appears you want to remove duplicates as well, which you can have by simply cast the finished list to set. Here's how to do it with pandas:
import pandas as pd
cities = pd.read_csv('yourfile.csv', usecols=['City'])['City']
# just cast to list if you want a plain list instead of a DataFrame
cities_list = list(cities)
# use set to remove the duplicates
unique_cities = set(cities)
In case you have need to preserve ordering, you might use an ordered dict with just keys.
Also, in case you're short on memory trying to read 5M rows in one go, you can read them in chuncks:
import pandas as pd
cities_chunks_list = [chunck['City'] for chunck in pd.read_csv('yourfile.csv', usecols=['City'], chunksize = 1000)]
#let's flatten the list
cities_list = [city for cities_chunk in cities_chunks_list for city in cities_chunk]
Hope I helped.

Use string literal instead of header name in Pandas csv file manipulation

Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?

You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})

use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')

Removing Unicode from Pandas Column Text

I am attempting to determine if the data inside a list is within a dataframe column. I am new to Pandas and have been struggling with this, so at the moment I am turning the dataframe column of interest into a list. However, when I df.tolist() the list contains a slew of unicode around the string. As i am attempting to compare this with text from the other list which is not in unicode I am running into issues.
I am attempted to turn the other list into unicode but then the list had items such that read like u'["item"]' which didn't help. I have also tried to remove the unicode from the dataframe but only get errors. I cannot iterate as pandas tells me that the dataframe is to long to iterate over. Below is my code:
SDC_wb = pd.ExcelFile('C:\ BLeh')
df = SDC_wb.parse(SDC_wb.sheet_names[1], header = 1)
def Follower_count(filename):
filename = open(filename)
reader = csv.reader(filename)
handles = df['things'].tolist()
print handles
dict1 = {}
for item in reader:
if item in handles:
user = api.get_user(item)
dict1[item] = user.Follower_count
newdf = pd.DataFrame(dict1)
newdf.to_csv('test1.csv', encoding='utf-8')
Here is what the list from the dataframe looks like:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
Here is what x = [unicode(s) for s in some_list] looks like:
u"['#HomeGoods']", u"['#pier1']", u"['#houzz']", u"['#InteriorDesign']", u"['#zulily']"]
Naturally these don't align to check the "in" requirement. Thus, I need a method of converting the .tolist() object from:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
to:
[#Mastercard, #Visa, #AmericanExpress, #CapitalOne]
so that the for item in handles function will see similar handles.
Thanks for your help.

save a list of different Dataframes to json

I have different pandas dataframes, which I put in a list.
I want to save this list in json (or any other format) which can be read by R.
import pandas as pd
def create_df_predictions(extra_periods):
"""
make a empty df for predictions
params: extra_periods = how many prediction in the future the user wants
"""
df = pd.DataFrame({ 'model': ['a'], 'name_id': ['a'] })
for col in range(1, extra_periods+1):
name_col = 'forecast' + str(col)
df[name_col] = 0
return df
df1 = create_df_predictions(9)
df2 = create_df_predictions(12)
list_df = [df1, df2]
The question is how to save list_df in a readable format for R? Note that df1 and df2 are have a different amount of columns!

don't know panda DataFrames in detail, so maybe this won't work. But in case it is kind of a traditional dict, you should be able to use the json module.
df1 = create_df_predictions(9)
df2 = create_df_predictions(12)
list_df = [df1, df2]
You can write it to a file, using json.dumps(list_df), which will convert your list of dicts to a valid json representation.
import json
with open("my_file", 'w') as outfile:
outfile.write(json.dumps(list_df))
Edit: as commented by DaveR dataframes are't serializiable. You can convert them to a dict and then dump the list to json.
import json
with open("my_file", 'w') as outfile:
outfile.write(json.dumps([df.to_dict() for df in list_df]))
Alternatively pd.DataFrame and pd.Series have a to_json() method, maybe have a look at those as well.

To export the list of DataFrames to a single json file, you should convert the list into a DataFrame and then use the to_json() function as shown below:
df_to_export = pd.DataFrame(list_df)
json_output = df_to_export.to_json()
with open("output.txt", 'w') as outfile:
outfile.write(json_output)
This will export the full dataset to a single json string and export it to a file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting a string representation of dicts to an actual dict - python

Related

How to read a json data into a dataframe using pandas

How can I see a list of the variables in a CSV column?

Use string literal instead of header name in Pandas csv file manipulation

Removing Unicode from Pandas Column Text

save a list of different Dataframes to json

Categories

Resources