Efficiently plotting multiple columns in pandas - python

I would like to know how to efficiently plot groups of multiple columns in a pandas dataframe.
I have the following dataframe
| a | b | c |...|trial1.1|trial1.2|...|trial1.12|trial2.1|...|trial2.12|trial3.1|...|trial3.12|
GlobalID|
sd12f |...|...|...|...| 210.1 | 213.1 |...| 170.1 | 176.2 |...| 160.31 | 162.4 |...| 186.1 |
...
I would like to loop through the rows and for each row plot three waveforms: trial1.[1-12], trial2.[1-12], trial3.[1-12]. What is the most efficient way to do this? Right now I have:
t1 = df.ix[0][df.columns[[colname.startswith('trial1') for colname in df]]]
t2 = df.ix[0][df.columns[[colname.startswith('trial2') for colname in df]]]
t3 = df.ix[0][df.columns[[colname.startswith('trial3') for colname in df]]]
t1.astype(float).plot()
t2.astype(float).plot()
t3.astype(float).plot()
I need the .astype(float) because the values are originally strings. Is there some more efficient way of doing this I am missing? I am new to python and pandas.

How about first transverse the dataframe, then split the dataframe by trials, then plot.
# Transverse
data = pd.read_csv("data.txt").T
# Insert your code to remove irrelevant rows, like a, b, c in your example
#
# Group by the trial number (the first six characters) and plot
data.groupby(lambda x: x[:6], axis=0).plot()

Related

How to use qcut in a dataframe with conditions value from columns

I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?

Pyspark - how to merge transformed columns with an original DataFrame?

I created a function to test transformations on a DataFrame. This returns only the transformed columns.
def test_concat(df: sd.DataFrame, col_names: list) -> sd.DataFrame:
return df.select(*[F.concat(df[column].cast(StringType()), F.lit(" new!")).alias(column) for column in col_names])
How can I replace the existing columns with the transformed once in the original DF and return the whole DF?
Example DF:
test_df = self.spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
cols = ["metric"]
new_df = perform_concat(test_df, cols)
new_df.show()
Expected result:
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
It looks like I can drop the original columns from the DF and then somehow append the transformed. But not sure that it's the right way to achieve this.
I can see you have only adding a keyword in metric column , the same can be achieved using inbuilt spark function as below
The withColumn has two functionality
If the column is not present it will create a new clumn
If the column is there, it will perform the operation on the same column
Logic to Concat
from pyspark.sql import functions as F
df = df.withColumn('metric', F.concat(F.col('metric'), F.lit(' '), F.lit('new!')))
df = df.select('metric', 'score')
df.show()
Output---------
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
If you want to do it for many columns you would make a foldLeft call.
#dsk has the right approach.
You probably want to avoid joins in this case since there is no need to decouple operation you are describing from original dataframe (this is based on the examples you provided, if you have different needs in real case then maybe different example is needed).
columnsToTransform.foldLeft(df)(
(acc, next) => acc.withColumn(next, concat(col(next), lit("new !")))
)
Edit: Just realised what I am proposing only works for scala and that your snippet is in python.
For python similar will still work just instead of fold you will do a for:
df = yourOriginalDf
for(next in columnsToTransform):
df = df.withColumn(next, concat(col(next), lit("new !")))
Create a new dataframe with updated column values and a monotonically increasing id
new_df = test_concat(test_df, cols).withColumn("index", F.monotonically_increasing_id())
Drop the list of columns from first dataframe and a monotonically increasing id
test_df_upt = test_df.drop(*cols).withColumn("index", F.monotonically_increasing_id())
Join the above 2 dataframes and drop the index colum
test_df_upt.join(new_df, "index").drop("index").show()

Run-time crashes when using CountVectorizer to create a term frequency dataframe from a column of lists

I am new to python and doing a data analysis project and would request some help.
I have a dataframe of 400,000 rows and it has columns: ID, Type 1 Categories, Type 2 Categories, Type 3 Categories, Amount, Age, Fraud.
The Category columns are column of lists. This list contains different terms which I want to take and create a matrix that counts and shows how many times a particular terms occurs in that row (with a column per term and frequency).
So the goal is to create a dataframe of sparse matrix, with each of those unique categories becoming a column - my dataset has over 2000 different categories - maybe thats why the count vectorizer is not good for this?
I tried two methods one using a count vectorizer and another using for loops
but Count Vectorizer crashes everytime it runs.
The second method is far too slow. Therefore, I was wondering if there anyway to improve these solutions.
I also split the dataframe into multiple chunks and it still causing problems
Example:
+------+--------------------------------------------+---------+---------+
| ID | Type 1 Category | Amount | Fraud |
+------+--------------------------------------------+---------+---------+
| ID1 | [Lex1, Lex2, Lex1, Lex4, Lex2, Lex1] | 110.0 | 0 |
| ID2 | [Lex3, Lex6, Lex3, Lex6, Lex3, Lex1, Lex2] | 12.5 | 1 |
| ID3 | [Lex7, Lex3, Lex2, Lex3, Lex3] | 99.1 | 0 |
+------+--------------------------------------------+---------+---------+
col = 'Type 1 Category'
# prior to this, I combined the entire dataframe based on ID
# this was from old dataframe where each row had different occurrence of id
# and only one category per row
terms = df_old[col].unique()
countvec = CountVectorizer(vocabulary=terms)
# create bag of words
df = df.join(pd.DataFrame(countvec.fit_transform(df[col]).toarray(),
columns=countvec.get_feature_names(),
index=df.index))
# drop original column of lists
df = df.drop(col, axis = 1)
##### second split dataframe to chunks using np.split
df_l3 = df_split[3]
output.index = df_l3.index
# Assign the columns.
output[['ID', '[col]']] = df_l3[['ID', '[col]']]
# split dataframe into chunks and 114305 is where the index starts
last = 114305+int(df_l3.shape[0])
for i in range(114305,last):
print(i)
for word in words:
output.ix[i,str(word)] = output[col][i].count(str(word))
Runs of memory for count vectorizer, second one does not count frequencies anymore. Worked for chunk 1 where index starts from zero but does not for the others.

Pandas table join and dedupe deciding which row to keep

I am trying to inner join and dedupe two tables while using a more complicated method of deciding which rows to keep after deduping than keep first or keep last.
Table A contains distinct IDs, and Age.
Table B contains multiple duplicated ID numbers, Ages, and data.
Only one row in Table B is correct so I want to keep only this row. The correct row is the one where the two Ages are most similar, but I also know that the correct Table B Ages are always lower than or equal to Table A Ages.
Table A
|ID |Age|
|----|---|
|1234| 45|
Table B
|ID |Age|data |
|----|---|-----|
|1234| 43|dataX|
|1234| 46|dataY|
|1234| 22|dataZ|
What I want is:
Joined Table
|ID |Age_A|Age_B|data |
|----|-----|-----|-----|
|1234| 45| 43|dataX|
How can I achieve this in Python Pandas?
We using merge_asof and merge
pd.merge_asof(df1,df2.sort_values(['Age']),on='Age',by='ID').merge(df2[['Age','data']],on='data')
Out[686]:
ID Age_x data Age_y
0 1234 45 dataX 43
Also we can get rid of the 2nd merge
df2['Age_B']=df2.Age
pd.merge_asof(df1,df2.sort_values(['Age']),on='Age',by='ID')
Out[688]:
ID Age data Age_B
0 1234 45 dataX 43

Fill in missing boolean rows in Pandas

I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.
I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)

Categories