I'm a little confused about the workings of the Pandas dataframe.
I have a pd.dataframe that looks like this:
index | val1 | val2
-----------------------------------
20-11-2017 22:33:20 | 0.33 | 05.43
23-11-2017 23:34:14 | 4.23 | 09.43
I'd like to append a row to it, and be able to specify the index, which in my case is a date and time.
I have tried the following methods:
dataframe = pd.DataFrame(columns=['val1', 'val2'])
dataframe.loc[someDate] = [someVal, someVal]
This seems to overwrite if the index already exists, but I want to be able to have duplicate indices.
dataframe = pd.DataFrame(columns=['val1', 'val2'])
record = pd.Series(
index=[someDate],
data=[someVal, someVal]
)
dataframe.append(record)
This causes the application to hang without returning an exception or error.
Am I missing something? Is this the correct way of doing the thing I want to achieve?
Related
I created a function to test transformations on a DataFrame. This returns only the transformed columns.
def test_concat(df: sd.DataFrame, col_names: list) -> sd.DataFrame:
return df.select(*[F.concat(df[column].cast(StringType()), F.lit(" new!")).alias(column) for column in col_names])
How can I replace the existing columns with the transformed once in the original DF and return the whole DF?
Example DF:
test_df = self.spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
cols = ["metric"]
new_df = perform_concat(test_df, cols)
new_df.show()
Expected result:
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
It looks like I can drop the original columns from the DF and then somehow append the transformed. But not sure that it's the right way to achieve this.
I can see you have only adding a keyword in metric column , the same can be achieved using inbuilt spark function as below
The withColumn has two functionality
If the column is not present it will create a new clumn
If the column is there, it will perform the operation on the same column
Logic to Concat
from pyspark.sql import functions as F
df = df.withColumn('metric', F.concat(F.col('metric'), F.lit(' '), F.lit('new!')))
df = df.select('metric', 'score')
df.show()
Output---------
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
If you want to do it for many columns you would make a foldLeft call.
#dsk has the right approach.
You probably want to avoid joins in this case since there is no need to decouple operation you are describing from original dataframe (this is based on the examples you provided, if you have different needs in real case then maybe different example is needed).
columnsToTransform.foldLeft(df)(
(acc, next) => acc.withColumn(next, concat(col(next), lit("new !")))
)
Edit: Just realised what I am proposing only works for scala and that your snippet is in python.
For python similar will still work just instead of fold you will do a for:
df = yourOriginalDf
for(next in columnsToTransform):
df = df.withColumn(next, concat(col(next), lit("new !")))
Create a new dataframe with updated column values and a monotonically increasing id
new_df = test_concat(test_df, cols).withColumn("index", F.monotonically_increasing_id())
Drop the list of columns from first dataframe and a monotonically increasing id
test_df_upt = test_df.drop(*cols).withColumn("index", F.monotonically_increasing_id())
Join the above 2 dataframes and drop the index colum
test_df_upt.join(new_df, "index").drop("index").show()
I want to find all rows where a certain value is present inside the column's list value.
So imagine I have a dataframe set up like this:
| placeID | users |
------------------------------------------------
| 134986| [U1030, U1017, U1123, U1044...] |
| 133986| [U1034, U1011, U1133, U1044...] |
| 134886| [U1031, U1015, U1133, U1044...] |
| 134976| [U1130, U1016, U1133, U1044...] |
How can I get all rows where 'U1030' exists in the users column?
Or... is the real problem that I should not have my data arranged like this, and I should instead explode that column to have a row for each user?
What's the right way to approach this?
The way you have stored data looks fine to me. You do not need to change the format of storing data.
Try this :
df1 = df[df['users'].str.contains("U1030")]
print(df1)
This will give you all the rows containing specified user in df format.
When you are wanting to check whether a value exists inside the column when the value in the column is a list, it's helpful to use the map function.
Implementing it like below, with a lambda inline function, the list of values stored in the 'users' column is mapped to the value u, and userID is compared to it...
Really the answer is pretty straightforward when you look at the code below:
# user_filter filters the dataframe to all the rows where
# 'userID' is NOT in the 'users' column (the value of which
# is a list type)
user_filter = df['users'].map(lambda u: userID not in u)
# cuisine_filter filters the dataframe to only the rows
# where 'cuisine' exists in the 'cuisines' column (the value
# of which is a list type)
cuisine_filter = df['cuisines'].map(lambda c: cuisine in c)
# Display the result, filtering by the weight assigned
df[user_filter & cuisine_filter]
The following pyspark command
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
created the following result.
|URL_short |NumOfReqs|
+-----------------------------------------------------------------------------------------+---------+
|http1 | 500 |
|http4 | 500 |
|http2 | 500 |
|http3 | 500 |
In the original DataFrame dataFrame I have a column named success whose type is text. The value can be "true" or "false".
In the result I would like to have an additional column named for example NumOfSuccess which counts the elements having entry "true" in the original column success per category URL_short.
How can I modify
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
to output also the column satisfying the condition success=="trueperURL_short` category?
One way to do it is to add another aggregation expression (also turn the count into an agg expression):
import pyspark.sql.functions as f
dataFrame.groupBy("URL_short").agg(
f.count('*').alias('NumOfReqs'),
f.sum(f.when(f.col('success'), 1).otherwise(0)).alias('CountOfSuccess')
).show()
Note this assumes your success column is boolean type, if it's string, change the expression to f.sum(f.when(f.col('success') == 'true', 1).otherwise(0)).alias('CountOfSuccess')
I have dataframe rounds (which was the result of deleting a column from another dataframe) with the following structure (can't post pics, sorry):
----------------------------
|type|N|D|NATC|K|iters|time|
----------------------------
rows of data
----------------------------
I use groupby so I can then get the mean of the groups, like so:
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean()
I get the means that I wanted but I get a problem with the keys. The results_mean dataframe has the following structure:
----------------------------
| | | | | | |time|
|type|N|D|NATC|K|iters| |
----------------------------
rows of data
----------------------------
The only key recognized is time (I executed results_mean.keys()).
What did I do wrong? How can I fix it?
In your aggregated data, time is the only column. The other ones are indices.
groupby has a parameter as_index. From the documentation:
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
So you can get the desired output by calling
rounds = results.groupby(['type','N','D','NATC','K','iters'], as_index = False)
results_mean = rounds.mean()
Or, if you want, you can always convert indices to keys by using reset_index. Using
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean().reset_index()
should have the desired effect as well.
I've got the same problem of losing the dataframes's keys due to the use of the group_by() function and the answer I found for that problem was to convert the Dataframe into a CSV file then read this file.
I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.
I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)