Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset
Related
I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large
I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840
I have a graphlab sframe dataframe where few rows have similar id value in "uid" column.
| VIM Document Type | Vendor Number & Zone | Value <5000 or >5000 | Today Status |
+-------------------+----------------------+----------------------+--------------+
| PO_VR_GLB | 1613407EMEAi | Less than 5000 | 0 |
| PO_VR_GLB | 249737LATIN AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1216902NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1213709EMEAi | Less than 5000 | 0 |
| PO_MN_GLB | 882843NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
+-------------------+----------------------+----------------------+--------------+
+---------------------+
| uid |
+---------------------+
| 63068$#069 |
| 5789$#13 |
| 12933036$#IN6532618 |
| 12933022$#IN6590132 |
| 12932349$#IN6636468 |
| 12952077$#203250 |
| 13012770$#MUML04184 |
| 12945049$#112370 |
| 13582330$#CI160118 |
| 13012770$#MUML04184|
Here, I want to retain all the rows with unique uids and only one of the rows which have same uid, the row to be retained can be any row which has today status=1, (i.e. there can be rows where uid and row status are same, but other fields are different, in that case, we can keep any one of these rows.) I want to do these operations in graphlab sframes, but am unable to figure out how to proceed.
you may use SFrame.unique() that can give you unique rows
sf = sf.unique()
Other way can also be using either groupby() method or join() methods where you can specify column name and further work. You may read their documentation on turi.com click for various ways.
Another way (that I personally prefer) is to convert SFrame to Dataframe of pandas and work on getting data operations and again converting pandas Dataframe to SFrame. It depends on your choice and I hope this helps.
I have an sql table with the below data:
select day, counter from table1 where day LIKE '%201601%';
+----------+---------+
| day | counter |
+----------+---------+
| 20160101 | 125777 |
| 20160102 | 31720 |
| 20160105 | 24981 |
| 20160106 | 240366 |
| 20160107 | 270560 |
| 20160108 | 268788 |
| 20160109 | 254286 |
| 20160110 | 218154 |
| 20160111 | 250186 |
| 20160112 | 94532 |
| 20160113 | 71437 |
| 20160114 | 71121 |
| 20160115 | 71135 |
| 20160116 | 71325 |
| 20160117 | 209762 |
| 20160118 | 210305 |
| 20160119 | 257627 |
| 20160120 | 306353 |
| 20160121 | 214687 |
| 20160122 | 214680 |
| 20160123 | 149844 |
| 20160124 | 133741 |
| 20160125 | 82404 |
| 20160126 | 71403 |
| 20160127 | 71437 |
| 20160128 | 72005 |
| 20160129 | 71417 |
| 20160130 | 0 |
| 20160131 | 69937 |
+----------+---------+
I have a python script where i run by
python myapp.py January
The January variable includes the query below:
January = """select day, counter from table1 where day LIKE '%201601%';"""
What i would like to do is to have to run the script with different flags and get the script to calculate the sum for all of the days of the :
last month,
this month,
last week,
last two weeks
specific month.
At the moment i have different variables for the months of 2016, this way the script will become huge, i am sure there is an easier way to do this.
cursor = db.cursor()
cursor.execute(eval(sys.argv[1]))
display = cursor.fetchall()
MonthlyLogs = sum(int(row[1]) for row in display)
The point of this is that i want to see discrepancies between data. My overall aim is to display this data in php at a later date, but for now i would like for it to be written to a file, which at the moment it is being written to.
What is the best way to achieve this?
Thanks,
Incubator
For the first part, you can get access to current month and year using datetime.
from datetime import datetime
currentMonth = datetime.now().month
currentYear = datetime.now().year
parameter = str(currentYear) + str(currentMonth)
Passing this parameter in your query should work.
For the second, you may define a function with a flag that when set to 1, will set parameter as above otherwise parameter's value will be sys.argv[1].
How can I search through the entire row in a pandas dataframe for a phrase and if it exist create a new col where says it says 'Yes' and what columns in that row it found it in? I would like to be able to ignore case as well.
You could use Pandas apply function, which allows you to traverse rows or columns and apply your own function to them.
For example, given a dataframe
+--------------------------------------+------------+---+
| deviceid | devicetype | 1 |
+--------------------------------------+------------+---+
| b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 |
| 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 |
| cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 |
+--------------------------------------+------------+---+
Define a function
def pr(array, value):
condition = array[array.str.contains(value).fillna(False)].index.tolist()
if condition:
ret = array.append(pd.Series({"condition":['Yes'] + condition}))
else:
ret = array.append(pd.Series({"condition":['No'] + condition}))
return ret
Use it
df.apply(pr, axis=1, args=('Google',))
+---+--------------------------------------+------------+---+-------------------+
| | deviceid | devicetype | 1 | condition |
+---+--------------------------------------+------------+---+-------------------+
| 0 | b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 | [Yes, devicetype] |
| 1 | 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 | [No] |
| 2 | cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 | [No] |
+---+--------------------------------------+------------+---+-------------------+