ValueError when trying to convert Dictionary to DataFrame Pandas - python

I have a dictionary like this:
{'6DEC19': 0.61, '13DEC19': 0.58, '27DEC19': 0.63, '31JAN20': 0.66, '27MAR20': 0.69, '26JUN20': 0.71}
I'm very simply trying to turn this in to a DataFrame with the columns being 6DEC19, 13DEC19 etc, with the index then being set to the current date and hour, the code for which I would use as pd.Timestamp.now().floor('60min').
With the resulting df looking like this:
6DEC19 13DEC19 27DEC19 31JAN20 27MAR20 26JUN20
2019-12-04 20:00:00 0.61 0.58 0.63 0.66 0.69 0.71
My first step would just be to turn the dict in to a dataframe and as far as I'm concerned this code should do the trick:
df = pd.DataFrame.from_dict(dict)
But I get this error message: ValueError: If using all scalar values, you must pass an index.
I really have no idea what the problem is here? Any suggestions would be great, and if anyone can fit the problem of changing the index in to the bargin so much the better. Cheers

As the error message says you need to specify the index, so you can do the following:
import pandas as pd
d = {'6DEC19': 0.61, '13DEC19': 0.58, '27DEC19': 0.63, '31JAN20': 0.66, '27MAR20': 0.69, '26JUN20': 0.71}
df = pd.DataFrame(d, index=[pd.Timestamp.now().floor('60min')])
print(df)
Output
6DEC19 13DEC19 27DEC19 31JAN20 27MAR20 26JUN20
2019-12-04 17:00:00 0.61 0.58 0.63 0.66 0.69 0.71

try this:
import pandas as pd
a = {'6DEC19': [0.61], '13DEC19': [0.58], '27DEC19': [0.6], '31JAN20': [0.66], '27MAR20': [0.69], '26JUN20': [0.71]}
df = pd.DataFrame.from_dict(a)
print(df)

try this
newDF = pd.DataFrame(yourDictionary.items())

Related

Remove duplicate row in array based on specific column values in Python

I have an array like this:
array = [[0.91 0.33 0.09]
[0.52 0.63 0.05]
[0.91 0.33 0.11]
[0.52 0.63 0.07]
[0.62 0.41 0.01]
[0.36 0.37 0.01]]
I need it to remove the row with the larger value in the third column if the first two column values are duplicate. So this:
array2 = [0.91 0.33 0.11]
[0.52 0.63 0.07]
[0.62 0.41 0.01]
[0.36 0.37 0.01]]
I want a pythonic way to do this without for loops if possible.
Quick two common ways:
Python stdlib module itertools has a function filter()
Use a List comprehension
Since you have the condition, "if the first two column values are duplicate", you'll have to do some grouping first, and itertools also has groupby().

Winsorize dataframe columns per month while ignoring NaN's

I have a dataframe with monthly data and the following colums: date, bm and cash
date bm cash
1981-09-30 0.210308 2.487146
1981-10-31 0.241291 2.897529
1981-11-30 0.221529 2.892758
1981-12-31 0.239002 2.726372
1981-09-30 0.834520 4.387087
1981-10-31 0.800472 4.297658
1981-11-30 0.815778 4.459382
1981-12-31 0.836681 4.895269
Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
I get the following output: ValueError: cannot reindex from a duplicate axis
P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.
Edit: These solution from #Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from #Allolz I I ran :
df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which returned:
date
1980-01-31 0.00 1.580564e+00
0.01 1.599805e+00
0.25 2.388106e+00
0.50 6.427071e+00
0.75 1.200685e+01
0.99 5.133111e+01
1.00 5.530329e+01
After winsorizing I get:
date
1980-01-31 0.00 1.599805
0.01 1.617123
0.25 2.388106
0.50 6.427071
0.75 12.006854
0.99 47.756152
1.00 51.331114
It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.
One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).
Sample Data
import numpy as np
import panda as pd
np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
'val': np.random.normal(1, 0.95, N)})
Code
gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']
# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)
# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])
Output
The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.
df[df['val'] != df['val_wins']]
# date val lb ub val_wins
#42 2010-09-01 -1.686566 -1.125862 3.206333 -1.125862
#96 2010-04-01 -1.255322 -1.243975 2.995711 -1.243975
#165 2010-08-01 3.367880 -1.020273 3.332030 3.332030
#172 2010-09-01 -1.813011 -1.125862 3.206333 -1.125862
#398 2010-09-01 3.281198 -1.125862 3.206333 3.206333
#... ... ... ... ... ...
#9626 2010-12-01 3.626950 -1.198967 3.249161 3.249161
#9746 2010-11-01 3.472490 -1.259557 3.261329 3.261329
#9762 2010-09-01 3.460467 -1.125862 3.206333 3.206333
#9768 2010-06-01 -1.625013 -1.482529 3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967 3.249161 -1.198967
#
#[214 rows x 5 columns]

Python can't write to dataframe from dict. Simple

dict1 = {'0-10': -0.04,
'10-20': -0.01,
'20-30': -0.03,
'30-40': -0.04,
'40-50': -0.02,
'50-60': 0.01,
'60-70': 0.05,
'70-80': 0.01,
'80-90': 0.09,
'90-100': 0.04}
stat = pd.DataFrame()
for x,y in dict1.items():
stat[x] = y
I try to write dict values to my dataframe and associate the column name to the keys. But my output is this:
Empty DataFrame
Columns: [0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-100]
Index: []
Tried it multiple times. No syntax errors. What am I missing? Thanks.
Try this:
df = pd.DataFrame(dict1, index=[0])
or
df = pd.DataFrame([dict1])
print(df)
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
0 -0.042 -0.01 -0.03 -0.04 -0.02 0.01 0.05 0.01 0.09 0.04

Change column names in Pandas Dataframe from a list

Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!
You can use this :
df.columns = New_Labels
Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
df.columns = New_Labels
Take care of the sequence of new column names.
The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.

How to calculate average of numbers from multiple csv files?

I've files like the following as replicates from a simulation experiment I've been doing:
generation, ratio_of_player_A, ratio_of_player_B, ratio_of_player_C
So, the data is something like
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10
etc
Now, since I run this is in multiples, I've around ~1000 files for each experiment, giving various such numbers. Now, my problem is to average them all for 1 set of experiment.
Thus, I would like to have a file that contains the average ratio after each generation (averaged over multiple replicates, i.e. files)
All the replicate output files which need to be averaged are names like output1.csv, output2.csv, output3.csv .....output1000.csv
I'd be obliged if someone could help me out with a shell script, or a python script.
If I understood well, let's say you have 2 file like those:
$ cat file1
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10
$ cat file2
0, 0.99, 1, 0.02
1, 0.10, 0.90, 0.90
2, 0.30, 0.10, 0.30
And you want to do the mean between column of both files. So here is a way for the first column :
Edit : I found a better way, using pd.concat :
all_files = pd.concat([file1,file2]) # you can easily put your 1000 files here
result = {}
for i in range(3): # 3 being number of generations
result[i] = all_files[i::3].mean()
result_df = pd.DataFrame(result)
result_df
0 1 2
ratio_of_player_A 0.660 0.25 0.40
ratio_of_player_B 0.665 0.65 0.25
ratio_of_player_C 0.175 0.55 0.20
Other way with merge, but one needs to perform multiple merges
import pandas as pd
In [1]: names = ["generation", "ratio_of_player_A", "ratio_of_player_B", "ratio_of_player_C"]
In [2]: file1 = pd.read_csv("file1", index_col=0, names=names)
In [3]: file2 = pd.read_csv("file2", index_col=0, names=names)
In [3]: file1
Out[3]:
ratio_of_player_A ratio_of_player_B ratio_of_player_C
generation
0 0.33 0.33 0.33
1 0.40 0.40 0.20
2 0.50 0.40 0.10
In [4]: file2
Out[4]:
ratio_of_player_A ratio_of_player_B ratio_of_player_C
generation
0 0.99 1.0 0.02
1 0.10 0.9 0.90
2 0.30 0.1 0.30
In [5]: merged_file = file1.merge(file2, right_index=True, left_index=True, suffixes=["_1","_2"])
In [6]: merged_file.filter(regex="ratio_of_player_A_*").mean(axis=1)
Out[6]
generation
0 0.66
1 0.25
2 0.40
dtype: float64
Or this way (a bit faster I guess) :
merged_file.ix[:,::3].mean(axis=1) # player A
You can merge recursively before applying the mean() method if you have more than one file.
If I misunderstood the question, please show us what you expect from file1 and file2.
Ask if there is something you don't understand.
Hope this helps !
The following should work:
from numpy import genfromtxt
files = ["file1", "file2", ...]
data = genfromtxt(files[0], delimiter=',')
for f in files[1:]:
data += genfromtxt(f, delimiter=',')
data /= len(files)
You can load each of the 1000 experiments in a dataframe, sum them all then calculate the mean.
filepath = tkinter.filedialog.askopenfilenames(filetypes=[('CSV','*.csv')]) #select your files
for file in filepath:
df = pd.read_csv(file, sep=';', decimal=',')
dfs.append(df)
temp = dfs[0] #creates a temporary variable to store the df
for i in range(1,len(dfs)): #starts from 1 cause 0 is stored in temp
temp = temp + dfs[i];
result = temp/len(dfs)
your problem is not very clear..
if i understand it right..
>temp
for i in `ls *csv`
more "$i">>temp;
then you have all the data from different files in one big file. try to load in sqlite database (1. Create a table 2.Insert the data)
after that you can query your data like.
select sum(columns)/count(columns) from yourtablehavingtempdata etc.
try to see sqlite since your data is tabular.sqlite will be better suited in my opinion.

Categories