Iterate over python dict based on a csv column which is grouped - python

I have attached an csv file. I have written a python script which reads the csv file and iterates over a data frame and process the contents of csv and insert it into mongoDB.
Right now, all data is getting inserted into the DB.
Is there a way to iterate over python dict and only take first 10 ranks data (group rank column), this column is grouped as you can see in attached img.
file = request.files['file']
client = pymongo.MongoClient("mongodb://localhost:27017")
df = pd.read_csv(file)
final_dict = {}
for row in df.iterrows():
cluster_name = row[1][1]
print(cluster_name)
if cluster_name not in final_dict.keys():
final_dict[cluster_name] = {}
final_dict[cluster_name]["queries"] = []
final_dict[cluster_name]["queries"].append(
{"cluster_name": row[1][0], "cluster_rank": row[1][1],
"cluster_size": row[1][2]})
else:
final_dict[cluster_name]["queries"].append(
{"cluster_name": row[1][0], "cluster_rank": row[1][1], "cluster_size": row[1][2]})
db = client["db_name"]
for key in final_dict:
db.testing.insert_one(final_dict[key])

To only get rows that are equal to or less than 10 in the group rank you can use a loc option
df = df.loc[df['group rank'] <= 10]
df

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

Split and store dataframe but name based on unique values in specific cols

I have a dataframe like as below
data = pd.DataFrame({'email_id': ['abc#gmail.com;test1#gmail.com','abc#gmail.com;def#yahoo.com','abdc#gmail.com','ache#gmail.com','aqce#gmail.com','pqr#gmail.com','pqr#gmail.com'],
'Dept_id': [21,23,25,26,28,29,31],
'dept_name':['Science','Chemistry','Maths','Social','Physics','Botany','Zoology'],
'location':['KOR','ASN','ANZ','IND','AUS','NZ','NZ']})
I would like to do the below
a) Split the dataframe based the unique email_id values in the email_id column
b) Store the split data in a excel file. each email_id row will have one file.
c) Name the excel file based on corresponding unique values present in dept_id, dept_name and location columns
I tried the below
for k, v in data.groupby(['email_id']):
dept_unique_ids = v['Dept_id'].unique()
dept_unique_names = v['dept_name'].unique()
location_unique = v['location'].unique()
writer = pd.ExcelWriter(f"{k}.xlsx", engine='xlsxwriter')
v.to_excel(writer,columns=col_list,sheet_name=f'{k}',index=False, startrow = 1)
While the above code splits the file successfully but it names it based on the email_id which is used as key. Instead I want to name the file based on dept_id, dept_name and location for that specific key.
For ex: If you take email_id = pqr#gmail.com, they have two unique dept_ids which is 29 and 31, their unique dept_name is Botany and Zoology and unique location is NZ.
So, I want my file name to be 29_31_Botany_Zoology_NZ.xlsx.
Therefore, I expect my output files (for each unique email id row to have filenames like below)
update - error message
IIUC, you can use:
for k, v in data.groupby(['email_id']):
dept_unique_ids = '_'.join(v['Dept_id'].astype(str).unique())
dept_unique_names = '_'.join(v['dept_name'].unique())
location_unique = '_'.join(v['location'].unique())
filename = '_'.join([dept_unique_ids, dept_unique_names, location_unique])
print(filename)
# Use a context manager
with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
v.to_excel(writer,columns=col_list,sheet_name=f'{k}',index=False, startrow=1)
Output:
23_Chemistry_ASN
21_Science_KOR
25_Maths_ANZ
26_Social_IND
28_Physics_AUS
29_31_Botany_Zoology_NZ

How to group csv in python without using pandas

I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

How to append a dictionary to a pandas dataframe?

I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.

Categories