How to create a datetime indexed pandas DataFrame with hypothesis library? - python

I am trying to create a pandas DataFrame with the hypothesis library for code testing purporses with the following code:
from hypothesis.extra.pandas import columns, data_frames
from hypothesis.extra.numpy import datetime64_dtypes
#given(data_frames(index=datetime64_dtypes(max_period='Y', min_period='s'),
columns=columns("A B C".split(), dtype=int)))
The error I receive is the following:
E TypeError: 'numpy.dtype' object is not iterable
I suspect that this is because when I construct the DataFrame for index= I only pass a datetime element and not a ps.Series all with type datetime for example. Even if this is the case (I am not sure), still I am not sure how to work with the hypothesis library in order to achieve my goal.
Can anyone tell me what's wrong with the code and what the solution would be?

The reason for the above error was because, data_frames requires an index containing a strategy elements inside such as indexes for an index= input. Instead, the above datetime64_dtypes only provides a strategy element, but not in an index format.
To fix this we provide the index first and then the strategy element inside the index like so:
from hypothesis import given, strategies
#given(data_frames(index=indexes(strategies.datetimes()),
columns=columns("A B C".split(), dtype=int)))
Note that in order to get datetime we use datetimes().

Related

Pandas function to add field to dataframe does not work

I have some code which I want to use in a dynamic python function. This code adds a field to an existing dataframe and does some adjustments to it. However, I got the error "TypeError: string indices must be integers". What am I doing incorrectly?
See below the function plus the code for calling the function.
import pandas as pd
#function
def create_new_date_column_in_df_based_on_other_date_string_column(df,df_field_existing_str,df_field_new):
df[df_field_new] = df[df_field_existing_str]
df[df_field_new] = df[df_field_existing_str].str.replace('12:00:00 AM','')
df[df_field_new] = df[df_field_existing_str].str.strip()
df[df_field_new]=pd.to_datetime(df[df_field_existing_str]).dt.strftime('%m-%d-%Y')
return df[df_field_new]
#calling the function
create_new_date_column_in_df_based_on_other_date_string_column(df='my_df1',df_field_existing_str='existingfieldname',df_field_new='newfieldname')
The parameters df you are giving the function is of type str, and so is df_field_existing_str.
What basically you're doing is trying to slice a string/get a specific characters by using the [] (or the .__getitem__() method) with another string which is impossible.
You are not using a DataFrame here, only strings, thus you are getting this TypeError.

Inconsistency between pandas.DataFrame.plot(kind='box') and pandas.DataFrame.boxplot()

I have encountered the following problem when trying to make a boxplot of one column in a pandas.DataFrame vs another one. Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(60))
df.columns = ['Values']
names = ('one','two','three')*int(df.shape[0]/3)
df['Names'] = names
df.plot(x='Names', y='Values', kind='box')
df.boxplot(column='Values', by='Names')
I expect two plot to be the same, but I get:
Is it an expected behavior and if so, how the expression for the first plot should be changed to match the second one?
.boxplot() and .plot(kind='box')/.plot.box() are separate implementations. Problem with .plot(kind='box')/.plot.box() is that although the argument by exists, it is not implemented and therefore ignored (see this issue for example, and they never managed to document it properly), meaning that you won't be able to reproduce the result you get with .boxplot().
Tl;dr .plot(kind='box')/.plot.box() implemented poorly, use .boxplot() instead.

NameError: name 'mean' is not defined

Calculating the basic statistics, I get the following working well:
import pandas as pd
max(df[Price])
min(df[Price])
But, this is returning an error:
mean(df[Price])
NameError: name 'mean' is not defined
I'm just trying to understand the logic of this.
This one works well:
df[Price].mean()
What kind of statistics work after the dot and which ones must wrap the column?
min() and max() are functions provided as Python built-ins.
You can use them on any iterable, which includes Pandas series, which is why what you're doing works.
Pandas also provides .min() and .max() as methods on series and dataframes, so e.g. df["Price"].min() would also work. The full list of Series functions is here; the full list of DataFrame functions is here.
If you do want to use a free function called mean(), e.g. when you have something that's not a Pandas series and you don't want to convert it to one, one actually does exist in the Python standard library, but you will have to import it:
from statistics import mean

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Python Pandas: Data Slices

I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!
To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!

Categories