I have the following columns in a DataFrame:
| invoice_number | client | tax_rate_1_isp | tax_base_1_isp | tax_1_isp | tax_rate_2_isp | tax_base_2_isp | tax_2_isp | tax_rate_1_no_isp | tax_base_1_no_isp | tax_1_no_isp | tax_rate_2_no_isp | tax_base_2_no_isp | tax_2_no_isp | status |
|----------------|---------|----------------|----------------|-----------|----------------|----------------|-----------|-------------------|-------------------|--------------|-------------------|-------------------|--------------|---------|
| #1 | client1 | 15% | 100 | 15 | | | | 0% | 100 | 0 | 10% | 200 | 20 | correct |
| #2 | client2 | 0% | 300 | 0 | | | | 10% | 100 | 10 | | | | correct |
And I would like to reorganize the DataFrame so it looks like this:
invoice_number client tax_type tax_rate tax_base tax status
#1 client1 isp 15% 100 15 correct
#1 client1 no_isp 0% 100 0 correct
#1 client1 no_isp 10% 200 20 correct
#2 client2 isp 0% 300 0 correct
#2 client2 no_isp 10% 100 10 correct
where a new line is created for each group of tax_rate, tax_base and tax maintaining the same information for the rest of the columns and creating a new column that would specify which type of tax (isp or no_isp) corresponds to, which is identified in the column name of the first DataFrame.
The goal of doing this is at the end be able to create a pivot table from the data.
Is there an efficient way to do that?
What I am doing now, and it is pain, is to create different DataFrames selecting the columns that correspond to the same tax group, filtering those DataFrames to just select rows with data and appending those to a DataFrame that has the structure that I need.
What I shared is an example, however the actual data could easily have more than 50 tax groups...
Related
I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large
The dataset contains data about COVID-19 patients. It is in both EXCEL and CSV file formats, and contains several variables and over 7 thousand records (rows) which has made the problem extremely harder and very time consuming to solve manually. Below are the 4 most important variables (columns) needed in solving the problem; 1: id for identifying each record (row), 2: day_at_hosp for each day a patient remained admitted at the hospital, 3: sex of patient, 4: death for whether the patient eventually died or survived.
I want to create a new variable total_days_at_hosp which should contain a total of days a patient remained admitted at hospital.
Old Table:
_______________________________________
| id | day_at_hosp | sex | death |
|_______|_____________|________|________|
| 1 | 0 | male | no |
| 2 | 1 | | |
| 3 | 2 | | |
| 4 | 0 | female | no |
| 5 | 1 | | |
| 6 | 0 | male | no |
| 7 | 0 | female | no |
| 8 | 0 | male | no |
| 9 | 1 | | |
| 10 | 2 | | |
| 11 | 3 | | |
| 12 | 4 | | |
| ... | ... | ... | ... |
| 7882 | 0 | female | no |
| 7883 | 1 | | |
|_______|_____________|________|________|
New Table:
I want to convert table above into table below:
____________________________________________
| id |total_days_at_hosp| sex | death |
|_______|__________________|________|________|
| 1 | 3 | male | no |
| 4 | 2 | male | yes |
| 6 | 1 | male | yes |
| 7 | 1 | female | no |
| 8 | 5 | male | no |
| ... | ... | ... | ... |
| 2565 | 2 | female | no |
|_______|__________________|________|________|
NOTE: the id column is for every record entered, and multiple records were entered for each patient depending on how long a patient remained admitted at the hospital. The day_at_hosp variable contains days: 0=initial day at hospital, 1=second day at hospital, ... , n=nth last day at hospital.
The record (row) where the variable (column) day_at_hosp is 0 corresponds to all entries in other columns, if the record (row) for day_at_hosp is *not 0, say 1,2,3, ...,5 then it belongs to the patient right above, and all the corresponding variables (columns) are left blank.
However the dataset I need should look like the table below.
It should include a new variable (column) called total_days_at_hosp generated from the variable (column) day_at_hosp. The new variable (column) total_days_at_hosp is more useful in statistical tests to be conducted and will replace variable (column) day_at_hosp, so that all blank rows can be deleted.
To move from old table to new table the needed program should do the following:
day_at_hosp ===> total_days_at_hosp
0
1 ---> 3
2
-------------------------------------
0 ---> 2
1
-------------------------------------
0 ---> 1
-------------------------------------
0 ---> 1
-------------------------------------
0
1
2 ---> 5
3
4
-------------------------------------
...
-------------------------------------
0 ---> 2
1
-------------------------------------
How can I achieve this?
Another formula option without dummy value placed at end of the Old/New Table.
1] Create New Table by >>
Copy and paste all Old Table data to a unused area
Click "Autofilter"
In "days_at_hospital" column select =0 value
Copy and paste filter of admissions to New Table column F
Delete all 0s in rows of Column G
Then,
2] In G2, formula copied down :
=IF(F2="","",IF(F3="",MATCH(9^9,A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
Remark : If your "ID Column" is Text value, formula changed to :
=IF(F2="","",IF(F3="",MATCH("zzz",A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
It is apparent that your data are sorted by patient, and that your desired table will be much 'shorter' - accordingly the starting point for this answer is to apply an AutoFilter to your original data, setting the filter criterion to be days_at_hospital = 0, and then copy this filter of admissions to column F:
after deleting the old column G data, the formula below can then be entered in cell G2 and copied down
=INDEX(B:B,MATCH(F3,A:A,0)-1)+1
to keep the formula simple the same dummy maximum value should be entered at both the end of the old and new tables.
Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset
I have two pandas datasets
old:
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
new:
| alpha | beta | zeta | id | numb |
| ------ | ------------------ | ---------------| ------| -----|
| 1 | LA | Hwood | Q | Q400 |
| 2 | NY | queens | B | B200 |
| 3 | Chic | lincpark | D | D300 |
(Columns and data don't mean anything in particular, just an example).
I want to merge datasets in a way such that
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.id = new.numb, you only keep the entry from the old table. (in this case the row 2 on the old with queens would be kept as opposed to row 2 on new with queens)
Note that rows 3 and 4 on old are the same, but we still keep both. If there were 2 duplicates of these rows in new we consider them as 1-1 corresponding. If maybe there were 3 duplicates on new of rows 3 and 4 on old, then 2 are considered copies (and we don't add them, but we would add the third when we merge them)
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.numb is contained inside new.numb, you only keep the entry from the old table. (in this case the row 5 on the old with lincpark would be kept as opposed to row 3 on new with lincpark, because 300 is contained in new.numb)
Otherwise add the new data as new data, keeping the new table's id and numb, and having null for any extra columns that the old table has (new's row 1 with hollywood)
I have tried various merging methods along with the drop_duplicates method. The problem with the the latter is that I attempted to drop duplicates having the same alpha beta and zeta, but often they were deleted from the same datasource, because the rows were exactly the same.
This is what ultimately needs to be shown when merging. 2 of the rows in new were duplicates, one was something to be added.
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
| 1 | LA | Hwood | Q | | Q400|
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
Assuming df1 is your new and df2 is the old
Follow merge by IF conditions.
import pandas
dfinal = df1.merge(df2, on="alpha", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'idold' as 'idnew'.
dfinal = df1.merge(df2, how='inner', left_on='alpha', right_on='id')
If you want to be even more specific, you may read the documentation of pandas merge operation.
Also specify If conditions and perform merge operations by rows, and then drop the remaining columns in a temporary dataframe. And add values to that dataframe according to conditions.
I understand the answer is a little bit complex, but so is your question. Cheers :)
I have a graphlab sframe dataframe where few rows have similar id value in "uid" column.
| VIM Document Type | Vendor Number & Zone | Value <5000 or >5000 | Today Status |
+-------------------+----------------------+----------------------+--------------+
| PO_VR_GLB | 1613407EMEAi | Less than 5000 | 0 |
| PO_VR_GLB | 249737LATIN AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1216902NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1213709EMEAi | Less than 5000 | 0 |
| PO_MN_GLB | 882843NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
+-------------------+----------------------+----------------------+--------------+
+---------------------+
| uid |
+---------------------+
| 63068$#069 |
| 5789$#13 |
| 12933036$#IN6532618 |
| 12933022$#IN6590132 |
| 12932349$#IN6636468 |
| 12952077$#203250 |
| 13012770$#MUML04184 |
| 12945049$#112370 |
| 13582330$#CI160118 |
| 13012770$#MUML04184|
Here, I want to retain all the rows with unique uids and only one of the rows which have same uid, the row to be retained can be any row which has today status=1, (i.e. there can be rows where uid and row status are same, but other fields are different, in that case, we can keep any one of these rows.) I want to do these operations in graphlab sframes, but am unable to figure out how to proceed.
you may use SFrame.unique() that can give you unique rows
sf = sf.unique()
Other way can also be using either groupby() method or join() methods where you can specify column name and further work. You may read their documentation on turi.com click for various ways.
Another way (that I personally prefer) is to convert SFrame to Dataframe of pandas and work on getting data operations and again converting pandas Dataframe to SFrame. It depends on your choice and I hope this helps.