Dataframe Generator Based On Conditions Pandas

Dataframe Generator Based On Conditions Pandas - python

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

Related

Working with .csv data as a Pandas DataFrame, getting redundancy error when applying logic

Been working on this project all day and it's destroying me. Currently have finished web scraping and have a final .csv which contains the elements of a pandas dataframe. Working with this dataframe in a new file, and currently have the following:
df = pd.read_csv('active_homes.csv')
for i in range(len(df)):
add = df['Address'][i]
price = df['Price'][i]
if (price<100000) == True:
print(price)
'active_homes.csv' looks like this:
Address,Status,Price,Meta
"387 8th St, Burlington, CO 80807",For Sale,169500,"4bed2bath1,560sqft"
,and the resulting df's shape is (1764, 4).
This should, in theory, print the price for each iteration of price<100000.
In practice, it prints this:
I have confirmed that at each iteration of the above for loop, it is collecting the correct 'Price' and 'Address' information, and have also confirmed that at each interval the logic (price<100000) is working correctly. However, it is still doing the above. I was originally trying to just drop the rows of the dataframe that were <100000 but that wasn't doing anything. I was also trying to reassign the data to a new dataframe and it would either return an empty dataframe, or return a dataframe with duplicate data of this house (with the 'Price' of 58900).
So far, from all of that, I believe that the program is recognizing the amount of correct houses < 100000, but for some reason the assignment is sticking for the one address. It also does the same thing without assignment, as in:
for i in range(len(df)):
if (df['Price'][i]<100000) == True:
print(df['Price'][i])
Any help in identifying the error would be much appreciated.

With Pandas you try to never iterate everything in the traditional python way. Instead, you could achieve the desired result using the following method:
df = pd.read_csv('active_homes.csv')
temp_df = df[df["Price"]<100000] # initiating a new df isn't required, just a force of a habit
print(temp_df["Price"]) # displaying a series of houses that are below 100K; imo prettier print

variable dataframe name - loop works by itself, but not inside of function

I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.

Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)

No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]

Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.

I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))

Apply function to each element of multiple lists; return differently named dataframes

I have a function that returns specific country-currency pairs that are used in the following step.
The return is something like:
lst_dolar = ['USA_dolar','Canada_dolar','Australia_dolar']
lst_eur = ['France_euro','Germany_euro','Italy_euro']
lst_pound=['England_pound','Scotland_pound','Wales_pound']
I then use a function that returns a dataframe.
One of the parameters of this function is country-currency pair and the other is the period, from a list of periods:
period_lst = ['1y','2y','3y','4y','5y']
What I would like to do is to then get a list of dataframes, that will be then saved, each single one of them, to a different table, using SQLite3.
My question is how do I apply my function to each element of the lists of country-currency pairs and for each element of the period_lst and then obtain differently named dataframes as a result?
Ex: USA_dolar_1y
I then would like to be able to take each one of these dataframes and saved them to a table, in a database, that has the same name as each dataframe.
Thank you!

Whenever you think you need to dynamically name variables in Python, you probably want a dictionary:
def my_func(df, period):
# do something with period and dataframe and return the result
return df
period_lst = ['1y', '2y', '3y', '4y', '5y']
usa_dollar= {}
for p in period_lst:
usa_dollar[p] = my_func(df, p)
You can then access the various resulting dataframes (or whatever your function returns) by their period:
use_data(usa_dollar['3y'])
By the way: don't use capitals in your variable names, you should reserve CamelCase for class names and write function names and variable names in lowercase, separated by underscores for readability. So, usa_dollar, not USAdollar, for example.
This helps editors spot problems in your code and makes the code easier to read for other programmers, as well as future you. Look up PEP8 for more of these style rules.
Another by the way: if the only reason you want to keep the resulting dataframes in separate variables is to then write them to a file, you could just write the dataframe to the file once you've created it, and reuse the variable for the next one, if you have no immediate other need for the data you're about to overwrite.

Pyspark - Saving & using previously calculated values

I have a dataset of thousands of files and I read / treat them with PySpark.
First, I've created functions like the following one to treat the whole dataset and this is working great.
def get_volume_spark(data):
days = lambda i: i * 86400 # This is 60sec*60min*24h
partition = Window.partitionBy("name").orderBy(F.col("date").cast("long")).rangeBetween(days(-31), days(0))
data = data.withColumn("monthly_volume", F.count(F.col("op_id")).over(partition))\
.filter(F.col("monthly_volume") >= COUNT_THRESHOLD)
return data
Every day I got new files arriving and I want to treat new files ONLY and append data the the first created file instead of treating the whole dataset again with more data every day because it would be too long and operations has been already made.
The other thing is, here I split by month for example (I calculate the count per month), but no one can assure that I will have a whole month (and certainly not) in the new files. So I want to keep a counter or something to resume where I were.
I wanted to know if there's some way to do that or this is not possible at all.

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!

In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe Generator Based On Conditions Pandas - python

Related

Working with .csv data as a Pandas DataFrame, getting redundancy error when applying logic

variable dataframe name - loop works by itself, but not inside of function

Apply function to each element of multiple lists; return differently named dataframes

Pyspark - Saving & using previously calculated values

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

Categories

Resources