How to return a specific data structure with inner dictionary of lists - python

I have a csv file (image attached) and to take the CSV file and create a dictionary of lists with the format "{method},{number},{orbital_period},{mass},{distance},{year}" .
So far I have code :
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
print(inputm)
but my output is coming out like ['Radial Velocity', '1', '269.3', '7.1', '77.4', '2006']
when I want it to look like :
"Radial Velocity" : {"number":[1,1,1], "orbital_period":[269.3, 874.774, 763.0], "mass":[7.1, 2.21, 2.6], "distance":[77.4, 56.95, 19.84], "year":[2006.0, 2008.0, 2011.0] } , "Transit" : {"number":[1,1,1], "orbital_period":[1.5089557, 1.7429935, 4.2568], "mass":[], "distance":[200.0, 680.0], "year":[2008.0, 2008.0, 2008.0] }
Any ideas on how I can alter my code?

Hey SKR01 welcome to Stackoverflow!
I would suggest working with the pandas library. It is meant for table like contents that you have there. What you are then looking for is a groupby on your #method column.
import pandas as pd
def remove_index(row):
d = row._asdict()
del d["Index"]
return d
df = pd.read_csv("https://docs.google.com/uc?export=download&id=1PnQzoefx-IiB3D5BKVOrcawoVFLIPVXQ")
{row.Index : remove_index(row) for row in df.groupby('#method').aggregate(list).itertuples()}
The only thing that remains is removing the nan values from the resulting dict.

If you don't want to use Pandas, maybe something like this is what you're looking for:
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
header = inputm.pop(0)
del header[0] # probably you don't want "#method"
# create and populate the final dictionary
data = {}
for row in inputm:
if row[0] not in data:
data[row[0]] = {h:[] for h in header}
for i, h in enumerate(header):
data[row[0]][h].append(row[i+1])
print(data)

This is a bit complex, and I'm questioning why you want the data this way, but this should get you the output format you want without requiring any external libraries like Pandas.
import csv
with open('exoplanets.csv') as input_file:
rows = list(csv.DictReader(input_file))
# Create the data structure
methods = {d["#method"]: {} for d in rows}
# Get a list of fields, trimming off the method column
fields = list(rows[1])[1:]
# Fill in the data structure
for method in methods:
methods[method] = {
# Null-trimmed version of listcomp
# f: [r[f] for r in rows if r["#method"] == method and r[f]]
f: [r[f] for r in rows if r["#method"] == method]
for f
in fields
}
Note: This could be one multi-tiered list/dict comprehension, but I've broken it apart for clarity.

Related

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

How to save the result of scraping in a CSV?

I'm scraping the website www.kayak.it from different links for an academic project.
I need to save the result of web-scraping in a CSV. Following my code.
I need to have all the scraping of the different links in a single csv, but I am not able to do that with the code that I have. Thank you for helping!
from selenium import webdriver
import time
list_link = ['https://www.kayak.it/flights/MIL-PAL/2021-09-05/2021-09-06/?sort=bestflight__a&fs=stops=0', 'https://www.kayak.it/flights/MIL-ROM/2021-09-05/2021-09-06/?sort=bestflight__a&fs=stops=0']
for link in list_link:
#driver = webdriver.Chrome(executable_path="path")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get(link)
time.sleep(5)
flights = driver.find_elements_by_class_name("resultInner")
flights_dict = dict() # To add the dictionaries within a dictionary
#flight_dict = [] # To add the dictionaries in a list.
i = 1
print(len(flights))
for flight in flights:
driver.execute_script("arguments[0].scrollIntoView(true);", flight)
driver.execute_script("window.scrollBy(0,-300)")
flightdetails = {}
frowdet = []
details = flight.find_elements_by_xpath(".//div[#class='mainInfo']//li")
for d in details:
fd = ""
sd = ""
first = d.find_elements_by_xpath(".//div[#class='top']")
for f in first:
fd += f.get_attribute("innerText") + ' '
second = d.find_elements_by_xpath(".//div[#class='bottom']//span")
for s in second:
sd += s.get_attribute("innerText")
detstr = sd + ' - ' + fd
frowdet.append(detstr)
fprice = flight.find_element_by_xpath(".//span[#class='price-text']").get_attribute("innerText")[:2]
flightdetails["Flights"] = frowdet
flightdetails["Price"] = fprice
#print(flightdetails)
flights_dict[i] = flightdetails
i+=1
#flight_dict.append(flightdetails) # Append dictionaries to a list.
print(flights_dict)
driver.quit()
There are many ways to do this. I would suggest looking in to the Pandas library.
If you just need it directly in to a csv, use the csv library.
https://www.pythontutorial.net/python-basics/python-write-csv-file/
I prefer to use pandas, which allows you to write the scraped data in to a dataframe, which you can then export to either a CSV or excel with ease. Here's an example:
dataframe = pandas.DataFrame(
{'Stock Name' : names,
'Dimensions' : dimensions,
'Area (m2)' : areas,
'Length (m)' : lengths,
})
dataframe.to_excel("output.xlsx")
dataframe.to_csv("output.csv")
In the above code, i state the column names, and then the list names i want in that column. So, the column named Stock Name will contain every index of my list in python called names. Then i output it to both xlsx and csv.
Using a dataframe will format it nicely for you to use. Pandas has tons of functionality that allows you to join workbooks, join sheets and so on for Excel, but it's probably easier for you to scrape it all in to one list within python then write it to a CSV directly.
See here for Pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Python has built-in module for such tasks named csv, as you have list of dicts you might use csv.DictWriter, consider following simple example:
import csv
squares = [{'x':1,'square':1},{'x':2,'square':4},{'x':3,'square':9}]
with open('squares.csv','w',newline='') as csvfile:
fieldnames = ['x', 'square']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(squares)
create file squares.csv with following content
x,square
1,1
2,4
3,9
Note that by default csv.DictWriter use dialect excel, so you should be then be able to easily import it into Excel if you wish
you can write your final results in csv in this way:
import io
with io.open('output.csv', 'w', encoding="utf-8") as file:
for item in flights_dict:
#print(item)
#print(flights_dict[item])
file.write('%s,%s,%s,%s,%s,%s\n'%(item,list(flights_dict[item].keys())[0],list(flights_dict[item].values())[0][0],list(flights_dict[item].values())[0][1],list(flights_dict[item].keys())[1],list(flights_dict[item].values())[1]))
file.close()
here is results in csv

Creating dictionary from CVS file using lists

I have a csv file which contains four columns and many rows, each representing different data, e.g.
OID DID HODIS BEAR
1 34 67 98
I have already opened and read the csv file, however I am unsure how I can make each column into a key. I believe the following format I have used in the code is best for the task I am creating.
Please see my code bellow, sorry if the explanation is a bit confusing.
Note that the #Values in column 1 is what I am stuck on, I am unsure how I can define each column.
for line in file_2:
the_dict = {}
OID = line.strip().split(',')
DID = line.strip().split(',')
HODIS = line.strip().split(',')
BEAR = line.strip().split(',')
the_dict['KeyOID'] = OID
the_dict['KeyDID'] = DID
the_dict['KeyHODIS'] = HODIS
the_dict['KeyBEAR'] = BEAR
dictionary_list.append(the_dict)
print(dictionary_list)
image
There is a great Python function for strings that will split strings based on a delimiter, .split(delim) where delim is the delimiter, and returns them as a list.
From the code that you have in your screenshot, you can use the following code to split on a ,, which I assume is your delimiter because you said that your file is a CSV.
...
for line in file_contents_2:
the_dict = {}
values = line.strip().split(',')
OID = values[0]
DID = values[1]
HODIS = values[2]
BEAR = values[3]
...
Also, in case you ever need to split a string based on whitespace, that is the default argument for .split() (the default argument is used when no argument is provided).
I would say this as whole code:
lod = []
with open(file,'r') as f:
l=f.readlines()
for i in l[1:]:
lod.append(dict(zip(l[0].rstrip().split(),i.split())))
split doesn't need a parameter, just use simple for loop in with open, no need for knowing keys
And if care about empty dictionaries do:
lod=list(filter(None,lod))
print(lod)
Output:
[{'OID': '1', 'DID': '34', 'HODIS': '67', 'BEAR': '98'}]
If want integers:
lod=[{k:int(v) for k,v in i.items()} for i in lod]
print(lod)
Output:
[{'OID': 1, 'DID': 34, 'HODIS': 67, 'BEAR': 98}]
Another way to do it is using libraries like Pandas, that is powerful in working with tabular data. It is fast as we avoid loops. In the example below you only need Pandas and the name of the CSV file. I used io just to transform string data to mimic csv.
import pandas as pd
from io import StringIO
data=StringIO('''
OID,DID,HODIS,BEAR\n
1,34,67,98''') #mimic csv file
df = pd.read_csv(data,sep=',')
print(df.T.to_dict()[0])
At the bottom you need only one-liner that chains commands. Read csv, transpose and tranform to dictionary:
import pandas as pd
csv_dict = pd.read_csv('mycsv.csv',sep=',').T.to_dict()[0]

Lookup a csv file using python 3x

I want to lookup below csv file and return the value from the field called 'datatype' passing mapping, transformation and portname as lookup ports.
Mapping transformation portname datatype
m_TEST_1 EXP_test_1 field_1 nstring
m_TEST_1 EXP_test_1 field_2 date/time
Basically, I want to perform (Select datatype from csv_file where mapping=? and transformation=? and portname=? )
Currently , I'm looping through each row of the csv file to fetch the datatype. Is there any easy and better way to do it.
Below is the current code that I'm using.
lkp_file = csv.DictReader(open(lkpfile))
for row in lkp_file:
if mapping.get('NAME')==row['Mapping']:
if frominstance==row['transformation']:
if fromfield==row['portname']:
fromdatatype=row['datatype']
break
The best approach you could have is a csv.DictReader and them some kind of transformation.
Is (Mapping, transformation, portname) unique?
If so you can do somehting similar to this:
import csv
d = {}
with open("path-to.csv", "r") as f:
for row in csv.DictReader(f, delimiter=",")):
d[(row['Mapping'], row['transformation'], row['portname'])] = row['datatype']
You will have to swap the delimiter as in my examplke I use commas and you do not have them in the text you gave us.
Why not make use of Pandas?
Your csv file:
example.csv:
Mapping,transformation,portname,datatype
m_TEST_1,EXP_test_1,field_1,nstring
m_TEST_1,EXP_test_1,field_2,date/time
Code:
import pandas as pd
df = pd.read_csv('example.csv')
reqd_cols = df[(df.Mapping == 'm_TEST_1') & (df.transformation == 'EXP_test_1') & (df.portname == 'field_1')]
print(reqd_cols)
# Mapping transformation portname datatype
# 0 m_TEST_1 EXP_test_1 field_1 nstring

How to Perform Mathematical Operation on One Value of a CSV file?

I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)

Categories