clean and split dataframe rows in different rows - python

I have a DataFrame of users information like Name as index, Mail, Birthday, Genre, User, Age... but the dataframe have some lists in different values, for example:
Name | Mail | Birthday | Genre | User | Subscription | Age | Comments
A | A#gmail.com | [1-1-1990, | [M, | [Z, | Y | [33,| -
| | 1-1-2000, | F] | A] | | 23,|
| | 1-1-1998] | | | | 25]|
B | B#gmail.com | [1-1-1970, | F | B | [Y, | [53,| -
| | 13-2-1998]| | | N] | 24]|
C | [C#gmail.com, | [1-1-1985, | [M, | C | [Y, | [38,| -
| C2#gmail.com] | 1-1-1975] | F] | | N] | 53]|
D | D#gmail.com | 1-1-1980 | M | [D, | Y | 43 | -
| | | | Q, | | |
| | | | R, | | |
| | | | T] | | |
E | [E#gmail.com, | 2-6-1975 | F | [E, | [Y, | 48 |
| G#gmail.com] | | | G] | N] | |
| | | | | | |
And I want to split the rows in different rows getting something like that depending the values and the cases:
Name | Mail | Birthday | Genre | User | Subscription | Age | Comments
A | A#gmail.com | 1-1-1990 | M | Z | Y | 33 | -
A2 | A#gmail.com | 1-1-1998 | F | A | Y | 25 | -
| | | | | | |
B | B#gmail.com | 13-2-1998 | F | B | Y | 24 | -
| | | | | | |
C | C#gmail.com | 1-1-1985 | M | C | Y | 38 | -
C2 | C2#gmail.com | 1-1-1975 | F | C2 | N | 53 | -
| | | | | | |
D | D#gmail.com | 1-1-1980 | M | D | Y | 43 | -
| | | | | | |
E | E#gmail.com | 2-6-1975 | F | E | Y | 48 | -
E2 | G#gmail.com | 2-6-1975 | F | G | N | 48 | -
| | | | | | |
Different possible cases:
Remove dates like 1-1-1970 and 1-1-2000, and ages of this dates
If only have list in user and not in the rest of columns remove all and use mail (without #)
If only one user and 2 or more cases in other rows take the mail (without #) as user
Split the rows with lists and if in a column there is not a list keep the same value in both rows
I don't know if it's possible i just took the data form a bad organized data base.
I got the first DataFrame using the function from How can i combine and pull apart rows of a DataFrame? answer.
I tried to split this rows with this function:
def split_list_cols(row):
split_rows = []
for col, val in row.items():
if isinstance(val, list):
for item in val:
new_row = row.copy()
new_row[col] = item
split_rows.append(new_row)
row[col] = None
return split_rows
but did not work well.
From this dataframe
Name | Mail | Birthday | Genre | User | Subscription | Age | Comments
E | [E#gmail.com, | 2-6-1975 | F | [E, | [Y, | 48 | -
| G#gmail.com] | | | G] | N] | |
| | | | | | |
gives me:
Name | Mail | Birthday | Genre | User | Subscription | Age | Comments
E | E#gmail.com | 2-6-1975 | F | [E, | [Y, | 48 | -
| | | | G] | N] | |
E | G#gmail.com | 2-6-1975 | F | [E, | [Y, | 48 | -
| | | | G] | N] | |
E | None | 2-6-1975 | F | E | [Y, | 48 |
| | | | | N] | |
E | None | 2-6-1975 | F | G | [Y, | 48 |
| | | | | N] | |
E | None | 2-6-1975 | F | None | Y | 48 |
| | | | | | |
| | | | | | |
E | None | 2-6-1975 | F | None | N | 48 |
and it should give:
E | E#gmail.com | 2-6-1975 | F | E | Y | 48 | -
E2 | G#gmail.com | 2-6-1975 | F | G | N | 48 | -
| | | | | | |

Different possible cases:
Its hard to understand all of the cases. But from what I understood I you could do this:
df = df.explode(["Birthday", "Genre", "User", "Subscription"])
df = df[(df["Birthday"].ne("1-1-1970")) & (df["Birthday"].ne("1-1-2000"))]
df = df.drop_duplicates(ignore_index=True)
df["Name"] = df["Name"] + df.groupby("Name").cumcount().add(1).astype(str)
df["User"] = df["User"] + df.groupby("User").cumcount().add(1).astype(str)

Related

Python Pivot Table based on multiple criteria

I was asking the question in this link SUMIFS in python jupyter
However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in.
Here is the dataframe (sorted based on the date):
+---------------+--------+---------+-----------+--------+
| Switch In/Out | Client | Quality | Date | Amount |
+---------------+--------+---------+-----------+--------+
| Out | 1 | B | 15-Aug-19 | 360 |
| In | 1 | A | 16-Aug-19 | 180 |
| In | 1 | B | 17-Aug-19 | 180 |
| Out | 1 | A | 18-Aug-19 | 140 |
| In | 1 | B | 18-Aug-19 | 80 |
| In | 1 | A | 19-Aug-19 | 60 |
| Out | 2 | B | 14-Aug-19 | 45 |
| Out | 2 | C | 15-Aug-20 | 85 |
| In | 2 | C | 15-Aug-20 | 130 |
| Out | 2 | A | 20-Aug-19 | 100 |
| In | 2 | A | 22-Aug-19 | 30 |
| In | 2 | B | 23-Aug-19 | 30 |
| In | 2 | C | 23-Aug-19 | 40 |
+---------------+--------+---------+-----------+--------+
I would then create a new column and divide them into different transactions.
+---------------+--------+---------+-----------+--------+------+
| Switch In/Out | Client | Quality | Date | Amount | Rows |
+---------------+--------+---------+-----------+--------+------+
| Out | 1 | B | 15-Aug-19 | 360 | 1 |
| In | 1 | A | 16-Aug-19 | 180 | 1 |
| In | 1 | B | 17-Aug-19 | 180 | 1 |
| Out | 1 | A | 18-Aug-19 | 140 | 2 |
| In | 1 | B | 18-Aug-19 | 80 | 2 |
| In | 1 | A | 19-Aug-19 | 60 | 2 |
| Out | 2 | B | 14-Aug-19 | 45 | 3 |
| Out | 2 | C | 15-Aug-20 | 85 | 3 |
| In | 2 | C | 15-Aug-20 | 130 | 3 |
| Out | 2 | A | 20-Aug-19 | 100 | 4 |
| In | 2 | A | 22-Aug-19 | 30 | 4 |
| In | 2 | B | 23-Aug-19 | 30 | 4 |
| In | 2 | C | 23-Aug-19 | 40 | 4 |
+---------------+--------+---------+-----------+--------+------+
With this, I can apply the pivot formula and take it from there.
However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python.
Thank you!
One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map.
Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter.
import pandas as pd
switchInOut = ["Out", "In", "In", "Out", "In", "In",
"Out", "Out", "In", "Out", "In", "In", "In"]
df = pd.DataFrame(switchInOut, columns=['Switch In/Out'])
counter = 1
def changeToOut(i):
global counter
if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In":
counter += 1
return counter
rows = df.index.map(changeToOut)
df["Rows"] = rows
df
Result:
+----+-----------------+--------+
| | Switch In/Out | Rows |
|----+-----------------+--------|
| 0 | Out | 1 |
| 1 | In | 1 |
| 2 | In | 1 |
| 3 | Out | 2 |
| 4 | In | 2 |
| 5 | In | 2 |
| 6 | Out | 3 |
| 7 | Out | 3 |
| 8 | In | 3 |
| 9 | Out | 4 |
| 10 | In | 4 |
| 11 | In | 4 |
| 12 | In | 4 |
+----+-----------------+--------+

Using pandas as a distance matrix and then getting a sub-dataframe of relevant distances

I have created a pandas df that has the distances between location i and location j. Beginning with a start point P1 and end point P2, I want to find the sub-dataframe (distance matrix) that has one axis of the df having P1, P2 and the other axis having the rest of the indices.
I'm using a Pandas DF because I think its' the most efficient way
dm_dict = # distance matrix in dict form where you can call dm_dict[i][j] and get the distance from i to j
dm_df = pd.DataFrame().from_dict(dm_dict)
P1 = dm_df.max(axis=0).idxmax()
P2 = dm_df[i].idxmax()
route = [i, j]
remaining_locs = dm_df[dm_df[~dm_df.isin(route)].isin(route)]
while not_done:
# go through the remaining_locs until found all the locations are added.
No error messages, but the remaining_locs df is full of nan's rather than a df with the distances.
using dm_df[~dm_df.isin(route)].isin(route) seems to give me a boolean df that is accurate.
sample data, it's technically the haversine distance but the euclidean should be fine for filling up the matrix:
import numpy
def dist(i, j):
a = numpy.array((i[1], i[2]))
b = numpy.array((j[1], j[2]))
return numpy.linalg.norm(a-b)
locations = [
("Ottawa", 45.424722,-75.695),
("Edmonton", 53.533333,-113.5),
("Victoria", 48.428611,-123.365556),
("Winnipeg", 49.899444,-97.139167),
("Fredericton", 49.899444,-97.139167),
("StJohns", 47.561389, -52.7125),
("Halifax", 44.647778, -63.571389),
("Toronto", 43.741667, -79.373333),
("Charlottetown",46.238889, -63.129167),
("QuebecCity",46.816667, -71.216667 ),
("Regina", 50.454722, -104.606667),
("Yellowknife", 62.442222, -114.3975),
("Iqaluit", 63.748611, -68.519722)
]
dm_dict = {i: {j: dist(i, j) for j in locations if j != i} for i in locations}
It looks like you want scipy's distance_matrix:
df = pd.DataFrame(locations)
x = df[[1,2]]
dm = pd.DataFrame(distance_matrix(x,x),
index=df[0],
columns=df[0])
Output:
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| | Ottawa | Edmonton | Victoria | Winnipeg | Fredericton | StJohns | Halifax | Toronto | Charlottetown | QuebecCity | Regina | Yellowknife | Iqaluit |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| 0 | | | | | | | | | | | | | |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| Ottawa | 0.000000 | 38.664811 | 47.765105 | 21.906059 | 21.906059 | 23.081609 | 12.148481 | 4.045097 | 12.592181 | 4.689667 | 29.345960 | 42.278586 | 19.678657 |
| Edmonton | 38.664811 | 0.000000 | 11.107987 | 16.759535 | 16.759535 | 61.080146 | 50.713108 | 35.503607 | 50.896264 | 42.813477 | 9.411122 | 8.953983 | 46.125669 |
| Victoria | 47.765105 | 11.107987 | 0.000000 | 26.267600 | 26.267600 | 70.658378 | 59.913580 | 44.241193 | 60.276176 | 52.173796 | 18.867990 | 16.637528 | 56.945306 |
| Winnipeg | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| Fredericton | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| StJohns | 23.081609 | 61.080146 | 70.658378 | 44.488147 | 44.488147 | 0.000000 | 11.242980 | 26.933071 | 10.500284 | 18.519147 | 51.974763 | 63.454538 | 22.625084 |
| Halifax | 12.148481 | 50.713108 | 59.913580 | 33.976105 | 33.976105 | 11.242980 | 0.000000 | 15.827902 | 1.651422 | 7.946971 | 41.444115 | 53.851052 | 19.731392 |
| Toronto | 4.045097 | 35.503607 | 44.241193 | 18.802741 | 18.802741 | 26.933071 | 15.827902 | 0.000000 | 16.434995 | 8.717042 | 26.111037 | 39.703942 | 22.761342 |
| Charlottetown | 12.592181 | 50.896264 | 60.276176 | 34.206429 | 34.206429 | 10.500284 | 1.651422 | 16.434995 | 0.000000 | 8.108112 | 41.691201 | 53.767927 | 18.320711 |
| QuebecCity | 4.689667 | 42.813477 | 52.173796 | 26.105163 | 26.105163 | 18.519147 | 7.946971 | 8.717042 | 8.108112 | 0.000000 | 33.587610 | 45.921044 | 17.145385 |
| Regina | 29.345960 | 9.411122 | 18.867990 | 7.488117 | 7.488117 | 51.974763 | 41.444115 | 26.111037 | 41.691201 | 33.587610 | 0.000000 | 15.477744 | 38.457705 |
| Yellowknife | 42.278586 | 8.953983 | 16.637528 | 21.334745 | 21.334745 | 63.454538 | 53.851052 | 39.703942 | 53.767927 | 45.921044 | 15.477744 | 0.000000 | 45.896374 |
| Iqaluit | 19.678657 | 46.125669 | 56.945306 | 31.794214 | 31.794214 | 22.625084 | 19.731392 | 22.761342 | 18.320711 | 17.145385 | 38.457705 | 45.896374 | 0.000000 |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
I am pretty sure this is what I wanted:
filtered = dm_df.filter(items=route,axis=1).filter(items=set(locations).difference(set(route)), axis=0)
filtered is a df with [2 rows x 10 columns] and then I can find the minimum value from there

Generating multiple csv files from a list in pandas, python

I'm trying to create a new dataframe for each possible combination in 'combinations' reading in some values from a dataframe, an example of the dataframe:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
Here is my code at the moment.
import itertools
import pandas as pd
letters = ['G','A','L','M','F','W','K','Q','E','S','P','V','I','C','Y','H','R','N','D','T']
combinations = [''.join(i) for j in range(1,len(letters) + 1) for i in itertools.combinations(letters,r=j)]
df = pd.read_csv('COMPLETECOPYFORR.csv')
for combination in combinations:
new_df = df[['Species', 'OGT']]
new_df['Sum of percentage'] = df[list(combination)]
new_df.to_csv(combination + '.csv')
The desired output is something along the lines of 10 million CSV files, each with the name of the different combinations, so
G.csv, A.csv, through to GALMFWKQESPVICYHRNDT.csv
Species OGT Sum of percentage
------------------------------- ----- -------------------
Aeropyrum pernix 95 23.4353
Anaeromyxobacter dehalogenans 26 20.3232
Argobacterium fabrum 27 14.2312
Aquifex aeolicus 85 15.0403
Archaeoglobus fulgidus 83 34.0532
It looks like need:
new_df['Sum of percentage'] = df[list(combination)].sum(axis=1)

Pivot or Transpose a table in Python/Pandas

I have the following data
+------+-------+---------+---------+---------+
| Name | Group | Limit1W | Limit1M | Limit3M |
+------+-------+---------+---------+---------+
| Bob | A | 100 | 50 | 0 |
| Bob | B | 100 | 50 | 0 |
| Alex | A | 20 | 50 | 0 |
| Alex | B | 0 | 0 | 0 |
+------+-------+---------+---------+---------+
I want to get
+------+-------+---------+-------+
| Name | Group | Game | Limit |
+------+-------+---------+-------+
| Bob | A | Limit1W | 100 |
| Bob | A | Limit1M | 50 |
| Bob | A | Limit3M | 0 |
| Bob | B | Limit1W | 100 |
| Bob | B | Limit1M | 50 |
| Bob | B | Limit3M | 0 |
| Alex | A | Limit1W | 20 |
| Alex | A | Limit1M | 50 |
| Alex | A | Limit3M | 0 |
| Alex | B | Limit1W | 0 |
| Alex | B | Limit1M | 0 |
| Alex | B | Limit3M | 0 |
+------+-------+---------+-------+
I have tried using pivot_table but I keep getting:
typeError can only concatentante list (not "tuple") to list, even though the limits only contains numbers

Interpolate in SQL based on subgroup in django models

I have the following sheetinfo model with the following data:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 0 |
| SAT123A03 | SAT123 | A | 3 | 0 |
| SAT123A04 | SAT123 | A | 4 | 0 |
| SAT123A05 | SAT123 | A | 5 | 500 |
| SAT123B05 | SAT123 | B | 5 | 400 |
| SAT123B04 | SAT123 | B | 4 | 0 |
| SAT123B03 | SAT123 | B | 3 | 0 |
| SAT123B02 | SAT123 | B | 2 | 500 |
| SAT124A01 | SAT124 | A | 1 | 400 |
| SAT124A02 | SAT124 | A | 2 | 0 |
| SAT124A03 | SAT124 | A | 3 | 0 |
| SAT124A04 | SAT124 | A | 4 | 475 |
I would like to interpolate and update the table with the correct T_val.
Formula is:
new_t_val = delta / (cnt -1) * sheet_num + min_tvc_of_subgroup
For instance the top 5:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 425 |
| SAT123A03 | SAT123 | A | 3 | 450 |
| SAT123A04 | SAT123 | A | 4 | 475 |
| SAT123A05 | SAT123 | A | 5 | 500 |
I have a django query that works to update the data, however it is SLOW and stops after a while (due to type errors etc.)
My question is there a way to accomplish this in SQL?
The ability to do this as one database call doesn't exist in stock Django. 3rd party packages exist though: https://github.com/aykut/django-bulk-update
Example of how that package works:
rows = Model.objects.all()
for row in rows:
# Modify rows as appropriate
row.T_val = delta / (cnt -1) * row.sheet_num + min_tvc_of_subgroup
Model.objects.bulk_update(rows)
For datasets up to the 1,000,000 range, this should have reasonable performance. Most of the bottleneck in iterating through and .save()-ing each object is the overhead on a database call. The python part is reasonably fast. The above example has only two database calls so it will be perhaps an order of magnitude or two faster.

Categories