Create numpy array from panda daataframe inside a For loop - python

Lets say that i have the following dataframe:
data = {"Names": ["Ray", "John", "Mole", "Smith", "Jay", "Marc", "Tom", "Rick"],
"Sports": ["Soccer", "Judo", "Tenis", "Judo", "Tenis","Soccer","Judo","Tenis"]}
I want to have a for loop like that for each unique Sport i am able to retrieve a numpy array containing the Names of people playing that sport. In pseudo code that can be explainded as
for unique sport in sports:
nArray= numpy array of names of people practicing sport
---------
Do something with nArray
-------

Use GroupBy.apply with np.array:
df = pd.DataFrame(data)
s = df.groupby('Sports')['Names'].apply(np.array)
print (s)
Sports
Judo [John, Smith, Tom]
Soccer [Ray, Marc]
Tenis [Mole, Jay, Rick]
Name: Names, dtype: object
for sport, name in s.items():
print (name)
['John' 'Smith' 'Tom']
['Ray' 'Marc']
['Mole' 'Jay' 'Rick']

one way to go
df = pd.DataFrame(data)
for sport in df.Sports.unique():
list_of_names = list(df[df.Sports == sport].Names)
data = np.array(list_of_names)

You can do by pandas library for get list array of sport persons name.
import numpy as np
import pandas as pd
data = {"Names": ["Ray", "John", "Mole", "Smith", "Jay", "Marc", "Tom", "Rick"],
"Sports": ["Soccer", "Judo", "Tenis", "Judo", "Tenis","Soccer","Judo","Tenis"]}
df = pd.DataFrame(data)
unique_sports = df['Sports'].unique()
for sport in unique_sports:
uniqueNames = np.array(df[df['Sports'] == sport]['Names'])
print(uniqueNames)
Result :
['Mole' 'Jay' 'Rick']

Related

Load nested JSON with incremental key/timestamp into Pandas DataFrame

I am trying to read a JSON dataset (see below a part of it). I want to use it in a flattened Pandas DataFrame to have access to all columns, in particular "A" and "B "with some data as columns for further processing.
import pandas as pd
datajson= {
"10001": {
"extra": {"user": "Tom"},
"data":{"A":5, "B":10}
},
"10002":{
"extra": {"user": "Ben"},
"data":{"A":7, "B":20}
},
"10003":{
"extra": {"user": "Ben"},
"data":{"A":6, "B":15}
}
}
df = pd.read_json(datajson, orient='index')
# same with DataFrame.from_dict
# df2 = pd.DataFrame.from_dict(datajson, orient='index')
which results in Dataframe.
I am assuming there is a simple way without looping/appending and making a complicated and slow decoder but rather using for example Panda's json_normalize().
I don't think you will be able to do that without looping through the json. You can do that relatively efficiently though if you make use of a list comprehension:
def parse_inner_dictionary(data):
return pd.concat([pd.DataFrame(i, index=[0]) for i in data.values()], axis=1)
df = pd.concat([parse_inner_dictionary(v) for v in datajson.values()])
df.index = datajson.keys()
print(df)
user A B
10001 Tom 5 10
10002 Ben 7 20
10003 Ben 6 15

Split column containing list of dics and match dictionnairy values to new column based on key

I am in need of some help in python.
I have a column in my dataset that contains a list of dictionaries. An example of the contents of such a cell are as follows:
lyst = [
{"region": "Baden-Wuerttemberg", "percentage":0.176553},
{"region": "Bayern", "percentage":0.142457},
{"region": "Bremen", "percentage":0.008874},
{"region": "Hamburg", "percentage":0.01915},
{"region": "Hessen", "percentage":0.09612},
{"region": "Niedersachsen", "percentage":0.094816},
{"region": "Nordrhein-Westfalen", "percentage":0.244745},
]
There are up to 17 dictionaries per list. Each list describes the percentage of ad targeting in each German states involved in an advertisement. Unfortunately the order in which the states are mentioned in each cell is different and some cells do not include some states that others do
enter image description here
This is what the cells look like in the original dataset
enter image description here
Id like to split the column based on name of the region and then put the percentage value in the new cell. I have no idea where to start, any help would be greatly appreciated.
We can generate some simple test data as follows:
import pandas as pd
cell1 = [
{"region": "Baden-Wuerttemberg", "percentage":0.176553},
{"region": "Bayern", "percentage":0.142457},
{"region": "Bremen", "percentage":0.008874},
{"region": "Hamburg", "percentage":0.01915},
{"region": "Hessen", "percentage":0.09612},
{"region": "Niedersachsen", "percentage":0.094816},
{"region": "Nordrhein-Westfalen", "percentage":0.244745},
]
cell2 = [
{'region': 'Baden-Wuerttemberg', 'percentage': 0.5882765},
{'region': 'Bremen', 'percentage': 0.504437},
{'region': 'Hessen', 'percentage': 0.54806}
]
old_data = pd.DataFrame({
"Advert ID": ["Ad-" + str(id) for id in range(1, 3)],
"Efficacy": [
cell1,
cell2
],
}
)
print(old_data)
Now that we have some test data, we extract the names of the 17 German states.
column = old_data["Efficacy"]
lysts = [t[1] for t in column.iteritems()]
regions = set(itts.chain.from_iterable([[d["region"] for d in lyst] for lyst in lysts]))
print(regions)
regions is a set of strings; something like the following:
{'Nordrhein-Westfalen', 'Baden-Wuerttemberg', 'Bremen', 'Bayern', 'Niedersachsen', 'Hessen', 'Hamburg'}
let us make a new pandas.DataFrame such that there is one column for each of the German states.
import pandas as pd
new_data = pd.DataFrame(
columns = [old_data.columns[0]] + list(regions)
)
Now, you can populate the new data frame with data from the old data-frame.
EDIT:
To populate the new-data frame with data from your old data-frame, you can write something like this:
new_data = pd.DataFrame(
columns = [old_data.columns[0]] + list(regions)
)
adverts = [t[1] for t in old_data["Advert ID"].iteritems()]
# print("adverts == ", repr(str(adverts))[1:-1])
for advert in adverts:
old_row = old_data.loc[old_data["Advert ID"] == advert]
list_of_dicts = old_row["Efficacy"].to_list()[0]
rowi = 1 + len(new_data.index)
new_data.at[rowi, "Advert ID"] = advert
for dyct in list_of_dicts:
region = dyct["region"]
percentage = dyct["percentage"]
new_data.at[rowi, region] = percentage
Let us display what we have:
def pretty_print(df:pd.DataFrame):
colw = 30 # column width is 40 characters wide
import io
strm = io.StringIO()
d = df.to_dict()
for key in d.keys():
print( key.ljust(30), end="", file=strm)
print(*[repr(str(v))[1:-1].ljust(colw) for v in d[key].values()], file=strm)
# return strm.getvalue()
print(strm.getvalue())
pretty_print(new_data)
We get something like:
Advert ID Ad-1 Ad-2
Bremen 0.008874 0.504437
Baden-Wuerttemberg 0.176553 0.5882765
Bayern 0.142457 nan
Hessen 0.09612 0.54806
Hamburg 0.01915 nan
Nordrhein-Westfalen 0.244745 nan
Niedersachsen 0.094816 nan

Remove duplicate values while group by in pandas data frame

given input data frame
Required output
I am able to achieve this using groupby fn as df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text")), but duplicates are still there in the output tuple
You can use SeriesGroupBy.unique() to get the unique values of entity_text before applying tuple to the list, as follows:
(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)
Result:
entity_label entity_text
0 job_title (Full Stack Developer, Senior Data Scientist, Python Developer)
1 country (India, Malaysia, Australia)
Try this:
import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')
output:
entity_label entity_text
0 job_title full stack developer,senior data scientiest,py...
1 country Inida,Malaysia,India,Australia

Replace DataFrame column with nested dictionary value

I'm trying to replace the 'starters' column of this DataFrame
starters
roster_id
Bob 3086
Bob 1234
Cam 6130
... ...
with the player names from a large nested dict like this. The values in my 'starters' column are the keys.
{
"3086": {
"team": "NE",
"player_id":"3086",
"full_name": "tombrady",
},
"1234": {
"team": "SEA",
"player_id":"1234",
"full_name": "RussellWilson",
},
"6130": {
"team": "BUF",
"player_id":"6130",
"full_name": "DevinSingletary",
},
...
}
I tried using DataFrame.replace(dict) and Dataframe.map(dict) but that gives me back all the player info instead of just the name.
is there a way to do this with a nested dict? thanks.
let df be the dataframe and d be the dictionary, then you can use apply from pandas on axis 1 to change the column
df.apply(lambda x: d[str(x.starters)]['full_name'], axis=1)
I am not sure, if I understand your question correctly. Have you tried using dict['full_name'] instead of simply dict?
Try pd.concat with series.map:
>>> pd.concat([
df,
pd.DataFrame.from_records(
df.astype(str)
.starters
.map(dct)
.values
).set_index(df.index)
], axis=1)
starters team player_id full_name
roster_id
Bob 3086 NE 3086 tombrady
Bob 1234 SEA 1234 RussellWilson
Cam 6130 BUF 6130 DevinSingletary

Making a string out of pandas DataFrame

I have pandas DataFrame which looks like this:
Name Number Description
car 5 red
And I need to make a string out of it which looks like this:
"""Name: car
Number: 5
Description: red"""
I'm a beginner and I really don't get how do I do it? I'll probably need to apply this to some similar DataFrames later.
You can use iterrows to iterate over you dataframe rows, on each row you can then get the columns and print the result the way you want. For example:
import pandas as pd
dtf = pd.DataFrame({
"Name": ["car", "other"],
"Number": [5, 6],
"Description": ["red", "green"]
})
def stringify_dataframe(dtf):
text = ""
for i, row in dtf.iterrows():
for col in dtf.columns.values:
text += f"{col}: {row[col]}\n"
text += "\n"
return text
s = stringify_dataframe(dtf)
Now s contains the following:
>>> print(s)
Name: car
Number: 5
Description: red
Name: other
Number: 6
Description: green
Iterating over a Dataframe is faster when using apply.
import pandas as pd
df = pd.DataFrame({
"Name": ["car", "other"],
"Number": [5, 6],
"Description": ["red", "green"]
})
s = '\n'.join(
df.apply(
lambda row:
'\n'.join(f'{head}: {val}' for head, val in row.iteritems()),
axis=1))
Of course, for this small data set a for loop is faster, but on my machine a data set with 10 rows was already processed faster.
Another approach,
import pandas as pd
dtf = pd.DataFrame({
"Name": ["car", "other"],
"Number": [5, 6],
"Description": ["red", "green"]
})
for row_index in range(len(dtf)):
for col in dtf.columns:
print(f"{col}: {dtf.loc[row_index, col]}")
Name: car
Number: 5
Description: red
Name: other
Number: 6
Description: green
[Program finished]

Categories