Python - Pandas - Converting column with specific subsets into rows - python

I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x

You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840

Related

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Python, Pandas: shrinking a dataframe

inside my application i have a dataframe that looks similiar to this:
Example:
id | address | code_a | code_b | code_c | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | 1345wqdwqe | ....
2 | parkdrive 1 | 012ah8 | 012ah8a | dwqd4646 | ....
3 | parkdrive 2 | 852fhz | 852fhza | fewf6465 | ....
4 | parkdrive 3 | 456se1 | 456se1a | 856fewf13 | ....
5 | parkdrive 3 | 456se1 | 456se1a | gth8596s | ....
6 | parkdrive 3 | 456se1 | 456se1a | a48qsgg | ....
7 | parkdrive 4 | tg8596 | tg8596a | 134568a | ....
As you may see, every address can contain multiple entrys inside my dataframe, the code_a and code_b are following a certain pattern and only code_c is unqiue.
What I'm trying to obtain is a dataframe where the column code_c is ignored, dropped or whatever and the whole dataframe is reduced to only one entry for each address...something like this:
id | address | code_a | code_b | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | ...
3 | parkdrive 2 | 852fhz | 852fhza | ...
4 | parkdrive 3 | 456se1 | 456se1a | ...
7 | parkdrive 4 | tg8596 | tg8596a | ...
I tried the groupby-function, but this doesn't seemed to work - or is this even the right function?
Thanks for your help and good day to all of you!
You can drop_duplicates to do this
df.drop_duplicates(subset=[‘address’], inplace=True)
This will keep only a single entry per address
I think what you are looking for is
# in this way you are looking for all the duplicates rows in all columns except for 'code_c'
df.drop_duplicates(subset=df.columns.difference(['code_c']))
# in this way you are looking for all the duplicates rows ONLY based on column 'address'
df.drop_duplicates(subset='address')
I notice in your example data, if you drop columnC then all the entries with address "parkdrive 1" for example, are just duplicates.
you should drop the column c:
df.drop('code_c',axis=1,inplace=True)
Then you can drop the duplicates:
df_clean = df.drop_duplicates()

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

graphlab - sframe : How to remove rows which have same ids and condition on a column?

I have a graphlab sframe dataframe where few rows have similar id value in "uid" column.
| VIM Document Type | Vendor Number & Zone | Value <5000 or >5000 | Today Status |
+-------------------+----------------------+----------------------+--------------+
| PO_VR_GLB | 1613407EMEAi | Less than 5000 | 0 |
| PO_VR_GLB | 249737LATIN AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1216902NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1213709EMEAi | Less than 5000 | 0 |
| PO_MN_GLB | 882843NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
+-------------------+----------------------+----------------------+--------------+
+---------------------+
| uid |
+---------------------+
| 63068$#069 |
| 5789$#13 |
| 12933036$#IN6532618 |
| 12933022$#IN6590132 |
| 12932349$#IN6636468 |
| 12952077$#203250 |
| 13012770$#MUML04184 |
| 12945049$#112370 |
| 13582330$#CI160118 |
| 13012770$#MUML04184|
Here, I want to retain all the rows with unique uids and only one of the rows which have same uid, the row to be retained can be any row which has today status=1, (i.e. there can be rows where uid and row status are same, but other fields are different, in that case, we can keep any one of these rows.) I want to do these operations in graphlab sframes, but am unable to figure out how to proceed.
you may use SFrame.unique() that can give you unique rows
sf = sf.unique()
Other way can also be using either groupby() method or join() methods where you can specify column name and further work. You may read their documentation on turi.com click for various ways.
Another way (that I personally prefer) is to convert SFrame to Dataframe of pandas and work on getting data operations and again converting pandas Dataframe to SFrame. It depends on your choice and I hope this helps.

search for string in pandas row

How can I search through the entire row in a pandas dataframe for a phrase and if it exist create a new col where says it says 'Yes' and what columns in that row it found it in? I would like to be able to ignore case as well.
You could use Pandas apply function, which allows you to traverse rows or columns and apply your own function to them.
For example, given a dataframe
+--------------------------------------+------------+---+
| deviceid | devicetype | 1 |
+--------------------------------------+------------+---+
| b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 |
| 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 |
| cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 |
+--------------------------------------+------------+---+
Define a function
def pr(array, value):
condition = array[array.str.contains(value).fillna(False)].index.tolist()
if condition:
ret = array.append(pd.Series({"condition":['Yes'] + condition}))
else:
ret = array.append(pd.Series({"condition":['No'] + condition}))
return ret
Use it
df.apply(pr, axis=1, args=('Google',))
+---+--------------------------------------+------------+---+-------------------+
| | deviceid | devicetype | 1 | condition |
+---+--------------------------------------+------------+---+-------------------+
| 0 | b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 | [Yes, devicetype] |
| 1 | 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 | [No] |
| 2 | cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 | [No] |
+---+--------------------------------------+------------+---+-------------------+

Categories