Adding columns to a data frame with data coming from multiple dataframes - python

I want this matrix as outcome:
int = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
"sales_year_data":['15','205','24','920','1310'],
"country_image_data":['2','5','-6','7','-1'],
}
df_inv = pd.DataFrame(int)
The data of colum "sales_year_data" in df_inv come df1=
sales_year_data = {
"country":['Spain','France','Germany','Belgium','Italy'],
"2014":['45','202','24','216','219'],
"2015":['15','55','214','2016','209'],
"2016":['615','2333','205','207','920'],
"2017":['1215','255','234','2116','101'],
"2018":['415','1320','214','2516','2019'],
"2019":['215','220','5614','416','1310'],
"2020":['205','202','44','296','2011'],
}
df1 = pd.DataFrame(sales_year_data)
As you can see in the column "sales_year_data" of df_inv, the number 15 is the intersection in df1 between year 2015 and Spain, the number 205 is in the intersection between Spain and 2020, 24 is in the intersection between Germany and 2014 and so on.
Data of colum "country_image_data" in df_inv comes from df2
country_change_data = {
"country":['Spain','Spain','Germany','Italy','Italy'],
"2014":['4','2','-6','6','9'],
"2015":['2','5','-5','2','3'],
"2016":['5','3','5','7','9'],
"2017":['8','7','5','6','1'],
"2018":['5','1','4','6','2'],
"2019":['1','2','4','6','-1'],
"2020":['5','2','4','6','2'],
}
df2 = pd.DataFrame(country_change_data)
As you can see in the column "country_change_data" of df_inv, the number 2 is the intersection in df2 between year 2015 and Spain, the number 5 is in the intersection between Spain and 2020, -6 is in the intersection between Germany and 2014 and so on.
If my original dataframe is:
inv = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
}
df0 = pd.DataFrame(inv)
How could I automate the search across various df1 and df2 in the intersections of interest for building df_inv departing prom df0?

This does it.
sales_counters = {}
country_counters = {}
new_df_data = []
for row in df0.iloc:
c = row['country']
y = row['yeardum']
sales_idx = sales_counters[c] = sales_counters.get(c, -1) + 1
country_idx = country_counters[c] = country_counters.get(c, -1) + 1
d1 = df1[df1['country'] == c]
d2 = df2[df2['country'] == c]
sales_year = d1.iloc[min(sales_idx, d1.shape[0]-1)][y]
country_image = d2.iloc[min(country_idx, d2.shape[0]-1)][y]
new_df_data.append([sales_year, country_image])
df0 = pd.concat([df0, pd.DataFrame(new_df_data)], axis=1).rename({0: 'sales_year_data', 1: 'country_image_data'}, axis=1)
Test:
>>> df0
vendor country yeardum sales_year_data country_image_data
0 A Spain 2015 15 2
1 B Spain 2020 205 2
2 C Germany 2014 24 -6
3 D Italy 2016 920 7
4 E Italy 2019 1310 -1

Related

How to merge two unequal rows of a pandas dataframe, where one column value is to match and another column is to be added?

I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47

Pandas aggregations in python

I have the following data set. I want to create a dataframe that contains all teams and include the number of games played, wins, losses, and draws, and average point differential in 2017 (Y = 17).
Date Y HomeTeam AwayTeam HomePoints AwayPoints
2014-08-16 14 Arsenal Crystal Palace 2 1
2014-08-16 14 Leicester Everton 2 2
2014-08-16 14 Man United Swansea 1 2
2014-08-16 14 QPR Hull 0 1
2014-08-16 14 Stoke Aston Villa 0 1
I wrote the following code:
df17 = df[df['Y'] == 17]
df17['differential'] = abs(df['HomePoints'] - df['AwayPoints'])
df17['home_wins'] = np.where(df17['HomePoints'] > df17['AwayPoints'], 1, 0)
df17['home_losses'] = np.where(df17['HomePoints'] < df17['AwayPoints'], 1, 0)
df17['home_ties'] = np.where(df17['HomePoints'] == df17['AwayPoints'], 1, 0)
df17['game_count'] = 1
df17.groupby("HomeTeam").agg({"differential": np.mean, "home_wins": np.sum, "home_losses": np.sum, "home_ties": np.sum, "game_count": np.sum}).sort_values(["differential"], ascending = False)
But i dont think this is correct as I'm only accounting for home team..does someone have a cleanear method?
Melting the dataframe granting us two new lines per old line, this allows us to have a line for the HomeTeam and a line for the AwayTeam.
Please find the documentation for the melt method here : https://pandas.pydata.org/docs/reference/api/pandas.melt.html
df = pd.melt(df, id_vars=['Date', 'Y', 'HomePoints', 'AwayPoints'], value_vars=['HomeTeam', 'AwayTeam'])
df = df.rename({'value': 'Team', 'variable': 'Home/Away'}, axis=1)
df['Differential'] = df['Home/Away'].replace({'HomeTeam': 1, 'AwayTeam': -1}) * (df['HomePoints'] - df['AwayPoints'])
def count_wins(x):
return (x > 0).sum()
def count_losses(x):
return (x < 0).sum()
def count_draws(x):
return (x == 0).sum()
df = df.groupby('Team')['Differential'].agg(['count', count_wins, count_losses, count_draws, 'sum'])
df = df.rename({'count': 'Number of games', 'count_wins': 'Wins', 'count_losses': 'Losses', 'count_draws': 'Draws', 'sum': 'Differential'}, axis=1)

Flatten a column in pandas Dataframe

I have a json like the following:
js = """[{"id": 13, "kits": [{"kit": "KIT1216A", "quantity_parts": 80, "quantity_kit": 1},
{"kit": "KIT1216B", "quantity_parts":680, "quantity_kit": 11}],
"transaction_date": "2020-11-27T05:02:03.822000Z", "dispatch_date": "2020-11-27T05:02:05.919000Z", "transaction_no"
: 2005, "transporter_name": "TCI", "vehicle_details": "hr55ab3337", "invoice_number": "355733019", "remarks": "0", "sending_location": 11, "owner": 4}]"""
Where kits is a list which contains multiple dictionaries.
How do I flatten the dataframe which I created from it such that the data of kits is included in the row itself ?
I simply tried:
data = json.loads(js)
df = pd.DataFrame(data)
output:
id kits transaction_date dispatch_date transaction_no transporter_name vehicle_details invoice_number remarks sending_location owner
0 13 [{'kit': 'KIT1216A', 'quantity_parts': 80, 'qu... 2020-11-27T05:02:03.822000Z 2020-11-27T05:02:05.919000Z 2005 TCI hr55ab3337 355733019 0 11 4
Desired Output:
Use json_normalize:
data = json.loads(js)
cols = ['id','transaction_date','dispatch_date','transaction_no','transporter_name',
'vehicle_details','invoice_number','remarks','sending_location','owner']
df = pd.json_normalize(data, 'kits', cols)
print (df)
kit quantity_parts quantity_kit id transaction_date \
0 KIT1216A 80 1 13 2020-11-27T05:02:03.822000Z
1 KIT1216B 680 11 13 2020-11-27T05:02:03.822000Z
dispatch_date transaction_no transporter_name \
0 2020-11-27T05:02:05.919000Z 2005 TCI
1 2020-11-27T05:02:05.919000Z 2005 TCI
vehicle_details invoice_number remarks sending_location owner
0 hr55ab3337 355733019 0 11 4
1 hr55ab3337 355733019 0 11 4

Getting a value out of pandas dataframe based on a set of conditions

I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]
you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55
# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9

How to create a dataframe with simulated data in python

I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.
You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other
You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia

Categories