Related
Im new using python
please how should i do to get the result below. if the cod and date match of df_1 exists in df_2 then i should add the row as explained in my code below.
data1 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '12', '14', '15', '15', '18'], 'Zone': ['LA', 'NY', 'LA', 'NY', 'PARIS', 'PARIS'], 'Revenue_Radio': [10, 20, 30, 50, 40, 10]}
df_1 = pd.DataFrame(data1)
data2 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-08'], 'cod': ['12', '14', '15', '15', '18'], 'Zone': ['PARIS', 'NY', 'LA', 'NY', 'NY'], 'Revenue_Str': [10, 20, 30, 50, 5]}
df_2 = pd.DataFrame(data2)
My code id
dfx = df_2[df_2['cod'].isin(df_1['cod']) &
(df_2['date'].isin(df_1['date'])) ]
df = (df_1.merge(dfx, on=['date','cod','Zone'], how='outer')
.fillna(0)
.sort_values(['date','cod'], ignore_index=True))
Expected output
data_result = {'date': ['2021-06', '2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07', '2021-07','2021-07'], 'cod': ['12', '12', '12', '14', '14', '15', '15', '15', '18'], 'Zone': ['LA', 'NY', 'PARIS','LA', 'NY', 'NY', 'PARIS', 'LA', 'PARIS'], 'Revenue_Radio': [10, 20, 0, 30, 0, 50, 40, 0, 10], 'Revenue_Str': [0, 0, 10,0, 20, 50, 0, 30, 0]}
df_result = pd.DataFrame(data_result)
With my code below, im gotting something wrong which is 2021-06 14 NY that should not exist in the final df
IIUC, try:
output = df_1.merge(df_2, on=["date", "cod", "Zone"], how="outer")
output = output[output.set_index(["date", "cod"]).index.isin(df_1.set_index(["date", "cod"]).index)]
output = output.sort_values(['date','cod'], ignore_index=True)
date cod Zone Revenue_Radio Revenue_Str
0 2021-06 12 LA 10.0 0.0
1 2021-06 12 NY 20.0 0.0
2 2021-06 12 PARIS 0.0 10.0
3 2021-07 14 LA 30.0 0.0
4 2021-07 15 NY 50.0 50.0
5 2021-07 15 PARIS 40.0 0.0
6 2021-07 15 LA 0.0 30.0
7 2021-07 18 PARIS 10.0 0.0
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '3200', '65000'],
'H' : ['2', '15.5', '150.5', '1500', '54000'],
'W' : ['5', '85.0', '640.0', '1650', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000'],
'Width' : ['10', '100', '1000', '10000'],
'Height': ['10', '100', '1000', '10000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
So here I have two example dataframes. The first dataframe shows unique item numbers with given dimensions. df2 shows maximum allowable dimensions for given rank and code. Meaning all elements (length, width, height) must not exceed maximum given dimensions. I would like to check the dimensions in df1 against df2 until all dimension criteria are True in order to retrieve it's 'rank' and 'code'. So, in essence, iterate down row by row of df2 until all the criteria is True.
Make a new df3 as follows:
ItemNo Rank Code
001 1 aa
002 2 bb
003 3 cc
004 4 dd
005 5 ee
Using a numpy
changed sample data so that it's not just incrementing results
get index of row in df2 that matches required logic
build df3 using index in step 2
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '5', '65000'],
'H' : ['2', '15.5', '150.5', '5', '54000'],
'W' : ['5', '85.0', '640.0', '5', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000',100000],
'Width' : ['10', '100', '1000', '10000',100000],
'Height': ['10', '100', '1000', '10000',100000],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
# fix up datatypes for comparisons
df1.loc[:,["L","H","W"]] = df1.loc[:,["L","H","W"]].astype(float)
df2.loc[:,["Length","Height","Width"]] = df2.loc[:,["Length","Height","Width"]].astype(float)
# row by row comparison, argmax to get first True
idx = [np.argmax((df1.loc[r,["L","H","W"]].values
< df2.loc[:,["Length","Height","Width"]].values).all(axis=1))
for r in df1.index]
# finally the result
pd.concat([df1.ItemNo, df2.loc[idx,["Rank","Code"]].reset_index(drop=True)],axis=1)
ItemNo
Rank
Code
0
001
1
aa
1
002
2
bb
2
003
3
cc
3
004
1
aa
4
005
5
ee
I think you can try:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005', '006'],
'L' : ['5', '65.0', '445.0', '3200', '65000', '10'],
'H' : ['2', '15.5', '150.5', '1500', '54000','1000'],
'W' : ['5', '85.0', '640.0', '1650', '45000', '10']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000', '100000'],
'Width' : ['10', '100', '1000', '10000', '100000'],
'Height': ['10', '100', '1000', '10000', '100000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
df_sort = pd.DataFrame({'W': np.searchsorted(df2['Width'].astype(float), df1['W'].astype(float)),
'H': np.searchsorted(df2['Height'].astype(float), df1['H'].astype(float)),
'L': np.searchsorted(df2['Length'].astype(float), df1['L'].astype(float))})
df1['Rank'] = df_sort.max(axis=1).map(df2['Rank'])
df1['Code'] = df1['Rank'].map(df2.set_index('Rank')['Code'])
print(df1)
Output:
ItemNo L H W Rank Code
0 001 5 2 5 1 aa
1 002 65.0 15.5 85.0 2 bb
2 003 445.0 150.5 640.0 3 cc
3 004 3200 1500 1650 4 dd
4 005 65000 54000 45000 5 ee
5 006 10 1000 10 3 cc
The core to the code is the use of the np.searchsorted function. Which is used to find the index of the value of L in Length for example per the conditions listed in the documentations. So, I use np.searchsorted for each of the three dimension then, I take the largest value using max(axis=1) and assign the rank and code based on that largest value using map.
I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5
The data is a time series, with many member ids associated with many categories:
data_df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
There are about 50 categories with 30 members, each with around 1000 datapoints.
I am trying to make one plot per category.
By subsetting each category then plotting via:
fig, ax = plt.subplots(figsize=(8,6))
for i, g in category.groupby(['memeber']):
g.plot(y='data', ax=ax, label=str(i))
plt.show()
This works fine for a single category, however, when i try to use a for loop to repeat this for each category, it does not work
tests = pd.DataFrame()
for category in categories:
tests = df.loc[df['category'] == category]
for test in tests:
fig, ax = plt.subplots(figsize=(8,6))
for i, g in category.groupby(['member']):
g.plot(y='data', ax=ax, label=str(i))
plt.show()
yields an "AttributeError: 'str' object has no attribute 'groupby'" error.
What i would like is a loop that spits out one graph per category, with all the members' data plotted on each graph
Creating your dataframe
import pandas as pd
data_df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
then [EDIT after comments]
import matplotlib.pyplot as plt
import numpy as np
subplots_n = np.unique(data_df['category']).size
subplots_x = np.round(np.sqrt(subplots_n)).astype(int)
subplots_y = np.ceil(np.sqrt(subplots_n)).astype(int)
for i, category in enumerate(data_df.groupby('category')):
category_df = pd.DataFrame(category[1])
x = [str(x) for x in category_df['member']]
y = [float(x) for x in category_df['data']]
plt.subplot(subplots_x, subplots_y, i+1)
plt.plot(x, y)
plt.title("Category {}".format(category_df['category'].values[0]))
plt.tight_layout()
plt.show()
yields to
Please note that this nicely takes care also of bigger groups like
data_df2 = pd.DataFrame({'category': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5],
'member': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe', 'ric', 'mat', 'pip', 'zoe', 'qui', 'quo', 'qua'],
'data': ['23', '20', '20', '11', '16', '62', '34', '27', '12', '7', '9', '13', '7']})
Far from an expert with pandas, but if you execute the following simple enough snippet
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'Id': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
fig, ax = plt.subplots()
for item in df.groupby('category'):
ax.plot([float(x) for x in item[1]['category']],
[float(x) for x in item[1]['data'].values],
linestyle='none', marker='D')
plt.show()
you produce this figure
But there is probably a better way.
EDIT: Based on the changes made to your question, I changed my snippet to
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'Date': ['2018-09-14 00:00:22',
'2018-09-14 00:01:46',
'2018-09-14 00:01:56',
'2018-09-14 00:01:57',
'2018-09-14 00:01:58',
'2018-09-14 00:02:05'],
'category': [1, 1, 1, 2, 2, 2],
'Id': ['bob', 'joe', 'jim', 'sally', 'jane', 'doe'],
'data': ['23', '20', '20', '11', '16', '62']})
fig, ax = plt.subplots(nrows=np.unique(df['category']).size)
for i, item in enumerate(df.groupby('category')):
ax[i].plot([str(x) for x in item[1]['Id']],
[float(x) for x in item[1]['data'].values],
linestyle='none', marker='D')
ax[i].set_title('Category {}'.format(item[1]['category'].values[0]))
fig.tight_layout()
plt.show()
which now displays
I am using a weather API that responses with a JSON file. Here is a sample of the returned readings:
{
'data': {
'request': [{
'type': 'City',
'query': 'Karachi, Pakistan'
}],
'weather': [{
'date': '2019-03-10',
'astronomy': [{
'sunrise': '06:46 AM',
'sunset': '06:38 PM',
'moonrise': '09:04 AM',
'moonset': '09:53 PM',
'moon_phase': 'Waxing Crescent',
'moon_illumination': '24'
}],
'maxtempC': '27',
'maxtempF': '80',
'mintempC': '22',
'mintempF': '72',
'totalSnow_cm': '0.0',
'sunHour': '11.6',
'uvIndex': '7',
'hourly': [{
'time': '24',
'tempC': '27',
'tempF': '80',
'windspeedMiles': '10',
'windspeedKmph': '16',
'winddirDegree': '234',
'winddir16Point': 'SW',
'weatherCode': '116',
'weatherIconUrl': [{
'value': 'http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png'
}],
'weatherDesc': [{
'value': 'Partly cloudy'
}],
'precipMM': '0.0',
'humidity': '57',
'visibility': '10',
'pressure': '1012',
'cloudcover': '13',
'HeatIndexC': '25',
'HeatIndexF': '78',
'DewPointC': '15',
'DewPointF': '59',
'WindChillC': '24',
'WindChillF': '75',
'WindGustMiles': '12',
'WindGustKmph': '19',
'FeelsLikeC': '25',
'FeelsLikeF': '78',
'uvIndex': '0'
}]
}]
}
}
I used the following Python code in my attempt to reading the data stored in JSON file:
import simplejson as json
data_file = open("new.json", "r")
values = json.load(data_file)
But this outputs with an error as follows:
JSONDecodeError: Expecting value: line 1 column 1 (char 0) error
I am also wondering how I can save the result in a structured format in a CSV file using Python.
As stated below by Rami, the simplest way to do this would to use pandas to either a) .read_json(), or to use pd.DataFrame.from_dict(). however the issue with this particular case is you have nested dictionary/json. What do I mean it's nested? Well, if you were to simply put this into a dataframe, you'd have this:
print (df)
request weather
0 {'type': 'City', 'query': 'Karachi, Pakistan'} {'date': '2019-03-10', 'astronomy': [{'sunrise...
Which is fine if that's what you want. However, I am assuming you'd like all the data/instance flattened into a singe row.
So you'll need to either use json_normalize to unravel it (which is possible, but you'd need to be certain the json file follows the same format/keys throughout. And you'd still need to pull out each of the dictionaries within the list, within the dictionaries. Other option is use some function to flatten out the nested json. Then from there you can simply write to file:
I choose to flatten it using a function, then construct the dataframe:
import pandas as pd
import json
import re
from pandas.io.json import json_normalize
data = {'data': {'request': [{'type': 'City', 'query': 'Karachi, Pakistan'}], 'weather': [{'date': '2019-03-10', 'astronomy': [{'sunrise': '06:46 AM', 'sunset': '06:38 PM', 'moonrise': '09:04 AM', 'moonset': '09:53 PM', 'moon_phase': 'Waxing Crescent', 'moon_illumination': '24'}], 'maxtempC': '27', 'maxtempF': '80', 'mintempC': '22', 'mintempF': '72', 'totalSnow_cm': '0.0', 'sunHour': '11.6', 'uvIndex': '7', 'hourly': [{'time': '24', 'tempC': '27', 'tempF': '80', 'windspeedMiles': '10', 'windspeedKmph': '16', 'winddirDegree': '234', 'winddir16Point': 'SW', 'weatherCode': '116', 'weatherIconUrl': [{'value': 'http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png'}], 'weatherDesc': [{'value': 'Partly cloudy'}], 'precipMM': '0.0', 'humidity': '57', 'visibility': '10', 'pressure': '1012', 'cloudcover': '13', 'HeatIndexC': '25', 'HeatIndexF': '78', 'DewPointC': '15', 'DewPointF': '59', 'WindChillC': '24', 'WindChillF': '75', 'WindGustMiles': '12', 'WindGustKmph': '19', 'FeelsLikeC': '25', 'FeelsLikeF': '78', 'uvIndex': '0'}]}]}}
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data['data'])
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = column.replace('_', '')
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
results.to_csv('path/filename.csv', index=False)
Output:
print (results.to_string())
type query date astronomy0sunrise astronomy0sunset astronomy0moonrise astronomy0moonset astronomy0moonphase astronomy0moonillumination maxtempC maxtempF mintempC mintempF totalSnowcm sunHour uvIndex hourly0time hourly0tempC hourly0tempF hourly0windspeedMiles hourly0windspeedKmph hourly0winddirDegree hourly0winddir16Point hourly0weatherCode hourly0weatherIconUrl0value hourly0weatherDesc0value hourly0precipMM hourly0humidity hourly0visibility hourly0pressure hourly0cloudcover hourly0HeatIndexC hourly0HeatIndexF hourly0DewPointC hourly0DewPointF hourly0WindChillC hourly0WindChillF hourly0WindGustMiles hourly0WindGustKmph hourly0FeelsLikeC hourly0FeelsLikeF hourly0uvIndex
0 City Karachi, Pakistan 2019-03-10 06:46 AM 06:38 PM 09:04 AM 09:53 PM Waxing Crescent 24 27 80 22 72 0.0 11.6 7 24 27 80 10 16 234 SW 116 http://cdn.worldweatheronline.net/images/wsymb... Partly cloudy 0.0 57 10 1012 13 25 78 15 59 24 75 12 19 25 78 0