Using Faker with Pandas (Python 3.5)

Using Faker with Pandas (Python 3.5) - python

I have imported a csv into data frame and want to use faker to mask the First and Last names.
I am using the following code to call the function (list comprehension)
MasterDE['FirstName'] = [fake.first_name for i in range(MasterDE.FirstName.size)]
MasterDE['LastName'] = [fake.last_name for i in range(MasterDE.FirstName.size)]
When I try to inspect the data, I get this:
I would like to have the actual values and not this description. Will appreciate pointers on this.

Figured out what was wrong. I missed out the parentheses after calling the method.
MasterDE['LastName'] = [fake.last_name() for i in range(MasterDE.LastName.size)]
works fine.

Related

variable dataframe name - loop works by itself, but not inside of function

I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.

Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)

No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]

Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.

I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')

You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.

Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

get/access each chunk of dask.dataframe(df, chunksize=100)

I used below code to split a dataframe using dask:
result=dd.from_pandas(df, chunksize=75)
I use below code to create a custom json file:
for z in result:
createjson (z)
It just didnt work! how can I access to each chunk?

There may be a more native way (feels like there should be) but you can do:
for i in range(result.npartitions):
partition = result.get_partition(i)
# your code here

We do not know what your createjson function does, but perhaps it is covered by to_json().
Alternatively, if you really want to do something unique to each of your partition, and this is not unique to JSON, then you will want the method map_partitions().

Call a function over elements of a list in python

I am completely new to python and Pandas of course. I am trying to run a function "get url" which is function to get the complete/ extended url from small Url . I have a data frame in python consists all the short URLs. Now I am trying to do with following ways. One is to use "for" loop which loops and apply function on all the elements and will create a another series of extended URL but I am not able to , dont know why , I tried to write it like
for i in df2:
expanded(i) = get_real(df2[[i]])
print(expanded)df2.[i,'expanded']
next()
and i want also pass a function which will resume next on error but not sure how to do it.
again second solution i tried was passing a whole array to applymap fucntion
df4 = df3.applymap(get_real)
but this also doesnt work for me .
Thanks for all the help !

If the short urls are a column in the pandas dataFrame, you can use the apply function (though I am not sure if they would resume on error, most probably not).
Syntax -
df['<newcolumn>'] = df['<columnname>'].apply(<functionname>)
I am hoping all the short urls would be different rows in a single column.
If you want to use for loop , then you can do something like -
for idx in df.index:
try:
df['<newcolumn>'][idx] = <functionname>(df['<columnname>'][idx])
except <TheError you want to catch or if you do not know, leave empty>:
<Do logic for handling the error>

I think the problem you are having is treating the dataframe just like a dictionary that has keys and values.
I think all you need to do is use
new_df = df2['expanded']
But you should show us what the df2 looks like

Python sas7bdat module usage

I have to dump data from SAS datasets. I found a Python module called sas7bdat.py that says it can read SAS .sas7bdat datasets, and I think it would be simpler and more straightforward to do the project in Python rather than SAS due to the other functionality required. However, the help(sas7bdat) in interactive Python is not very useful and the only example I was able to find to dump a dataset is as follows:
import sas7bdat
from sas7bdat import *
# following line is sas dataset to convert
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
#following line is txt file to create
foo.convertFile('/support/textfiles/locked_data.txt','\t')
This doesn't do what I want because a) it uses the SAS variable names as column headers and I need it to use the variable labels, and b) it uses "nan" to denote missing numeric values where I'd rather just leave the value blank.
Can anyone point me to some useful documentation on the methods included in sas7bdat.py? I've Googled every permutation of key words that I could think of, with no luck. If not, can someone give me an example or two of using readColumnAttributes(), readColumnLabels(), and/or readColumnNames()?
Thanks, all.

As time passes, solutions become easier. I think this one is easiest if you want to work with pandas:
import pandas as pd
df = pd.read_sas('/support/sas/locked_data.sas7bdat')
Note that it is easy to get a numpy array by using df.values

This is only a partial answer as I've found no [easy to read] concrete documentation.
You can view the source code here
This shows some basic info regarding what arguments the methods require, such as:
readColumnAttributes(self, colattr)
readColumnLabels(self, collabs, coltext, colcount)
readColumnNames(self, colname, coltext)
I think most of what you are after is stored in the "header" class returned when creating an object with SAS7BDAT. If you just print that class you'll get a lot of info, but you can also access class attributes as well. I think most of what you may be looking for would be under foo.header.cols. I suspect you use various header attributes as parameters for the methods you mention.
Maybe something like this will get you closer?
from sas7bdat import SAS7BDAT
foo = SAS7BDAT(inFile) #your file here...
for i in foo.header.cols:
print '"Atrributes"', i.attr
print '"Labels"', i.label
print '"Name"', i.name
edit: Unrelated to this specific question, but the type() and dir() commands come in handy when trying to figure out what is going on in an unfamiliar class/library

I know I'm late for the answer, but in case someone searches for similar question. The best option is:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()

Personally I think the better approach would be to export the data using SAS then process the external file as needed using Python.
In SAS, you can do this...
libname datalib "/support/sas";
filename sasdump "/support/textfiles/locked_data.txt";
proc export
data = datalib.locked_data
outfile = sasdump
dbms = tab
label
replace;
run;
The downside to this is that while the column labels are used rather than the variable names, the labels are enclosed in double quotes. When processing in Python, you may need to programmatically remove them if they cause a problem. I hope that helps even though it doesn't use Python like you wanted.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Faker with Pandas (Python 3.5) - python

Figured out what was wrong. I missed out the parentheses after calling the method. MasterDE['LastName'] = [fake.last_name() for i in range(MasterDE.LastName.size)] works fine.

Related

variable dataframe name - loop works by itself, but not inside of function

How to get data from object in Python

get/access each chunk of dask.dataframe(df, chunksize=100)

Call a function over elements of a list in python

Python sas7bdat module usage

Categories

Resources