Creating dataset from loops - python

I created small dataset, can find below:
Later formed groups using CIQ column (using pandas group by syntax):
Entire code:
'''
fd = pd.read_csv("C:....\Test.csv")
coder_gr = fd.groupby(["CIQ"])
print(coder_gr.first())
for x, y in coder_gr:
y.Date.duplicated()
'''
Now I checked duplicates inside each group using for loop:
But I want output entire group dataset output plus along with duplicate loop output, for that I tried below code:
emp = []
for x, y in coder_gr:
emp.append(y)
emp.append(y.Date.duplicated())
output look like:
Desired output:
Not getting output in proper format. I don't know how to set proper output.

try this:
pd.option_context('display.max_rows', None, 'display.max_columns', None)
for x, y in coder_gr:
print(y)
print(y.Date.duplicated())

Finally I got the answer:
emp = pd.DataFrame()
for x, y in coder_gr:
emp = emp.append(pd.series(y), ignore_index=True)
emp = emp.append(y)

Related

get the full row data from the found extrema

I am new to using pandas and I can't find a way to get the full row of the found extrema
df = pd.read_csv('test.csv')
df['min'] = df.iloc[argrelextrema(df.Close.values, np.less_equal,
order=10)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df.Close.values, np.greater_equal,
order=10)[0]]['Close']
# create lists for `min` and `max`
min_values_list = df['min'].dropna().tolist()
max_values_list = df['max'].dropna().tolist()
print(min_values_list, max_values_list)
It print only the minima and extrema values, but I need the full row data of found minima / maxima
Example of data:
Datetime,Date,Open,High,Low,Close
2021-01-11 00:00:00+00:00,18638.0,1.2189176082611084,1.2199585437774658,1.2186205387115479,1.2192147970199585
If the list is required, then I would suggest:
def df_to_list_rowwise(df: pd.DataFrame) -> list:
return [df.iloc[_, :].tolist() for _ in range(df.shape[0])]
df_min_values = df.iloc[argrelextrema(np.array(df.Close), np.less_equal)[0], :]
df_max_values = df.iloc[argrelextrema(np.array(df.Close), np.greater_equal)[0], :]
print(df_to_list_rowwise(df_min_values))
print(df_to_list_rowwise(df_max_values))
Would that help?
try to use df.dropna().index.tolist() instead of specifying the column because adding the column name returns just the value of a specific row and the specified column not the whole row

Adding empty rows in Pandas dataframe

I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")

Python: How to parse variables from several pandas dataframes?

I want to extract the x and y variables from several pandas dataframes (before proceeding to next steps). I initialize the tab-delimited .txt file, before I extract the information.
Error raised is ValueError: too many values to unpack (expected 2).
import pandas as pd
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
X, y = self.df.iloc[1:, 1:]
return X, y
dp_linear_cna = DataProcessing("data_linear_cna.txt")
dp_mrna_seq_v2_rsem = DataProcessing("data_mrna_seq_v2_rsem.txt")
dp_linear_cna.extract_info()
dp_mrna_seq_v2_rsem.extract_info()
Traceback:
ValueError: too many values to unpack (expected 2)
The sep="/t" is supposed to be sep="\t".
Never iterate over rows/columns, select data using index.
e.g. selecting a column: df['some_column_name']
You coding style is quite bad. First of all, don't return anything in init. It's a constructor. Make another function instead.
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
def split_data(self):
X = self.df.iloc[:, :-1]
y = self.df.iloc[:, -1]
return X, y
Calling your DataProcessing like this:
def main():
dp = DataProcessing('data_linear_cna.txt')
X, y = dp.split_data()
print(X)
print()
print(y)
Main point here is selection over position via df.iloc[rows, columns]
X, y = self.df.iloc[1:, 1:]
this is not a valid statement. pandas.DataFrame.iloc return another pandas.DataFrame. Not a tuple. You can't do tuple unpacking.
Indexing both axes
You can mix the indexer types for the index and columns. Use : to select the entire axis.

Problem importing data from Excel to Treeview

I created a treeview to display the data imported from an excel file.
def afficher():
fichier = r"*.xlsx"
df = pd.read_excel(fichier)
for row in df:
refOF = row['refOF']
refP = row['refP']
refC = row['refC']
nbreP = row['nbreP']
dateFF = row['dateFF']
self.ordreF.insert("", 0, values=(refOF, refP, refC, nbreP, dateFF))
but I encounter the following error:
refOF = row['refOF']
TypeError: string indices must be integers
please tell me how I can solve this problem.
Another way is replacing the original for loop with the following:
for tup in df[['refOF', 'refP', 'refC', 'nbreP', 'dateFF']].itertuples(index=False, name=None):
self.ordreF.insert("", 0, values=tup)
It works because df.itertuples(index=False, name=None) returns a regular tuple without index in the assigned column order. The tuple can be fed into the values= argument directly.
With your loop you are actually not iterating over the rows, but over the column names. That is the reason for the error message, because row is the string with the colum name and if you use [] you need to specify an integer or an integer based slice, but not a string.
To make your code work, you would just need to modify your code a bit to iterate over the rows:
def afficher():
fichier = r"*.xlsx"
df = pd.read_excel(fichier)
for idx, row in df.iterrows():
refOF = row['refOF']
refP = row['refP']
refC = row['refC']
nbreP = row['nbreP']
dateFF = row['dateFF']
self.ordreF.insert("", 0, values=(refOF, refP, refC, nbreP, dateFF))

Pandas, assign object name using index strings

Using sample data:
Product = [Galaxy_8, Galaxy_Note_9, Galaxy_Note_10, Galaxy_11]
I would like to create 4 data frames, each of the data frames contains respective sales information.
The problem is I would like to use index method to create data frames, for instance,
Expected output is:
Galaxy_8 = pd.DataFrame()
Galaxy_Note_9 = Pd.DataFrame()
Galaxy_Note_10 = pd.DataFrame()
Galaxy_11 = pd.DataFrame()
Imagine if the product list counts beyond 200, what is the most efficient way to achieve the desired outcome?
Thank you
If the sample list is like,
Product = ['Galaxy_8', 'Galaxy_Note_9', 'Galaxy_Note_10','Galaxy_11']
Then you can try like:
for var in Product:
globals()[var] = pd.DataFrame()

Categories