I have a dataset:
Year Name Value
1 A 10
2 B 20
3 A 25
3 B 10
I want to be able to find how each name has changed over the years. Ideally the result should look like
Name Growth/Year
A (25-10)/(3-1)
B (10-20)/(3-2)
I can build a list of unique name first, and then loop through the dataset to find the value change. But is there an easy way to do it?
Try using pandas groupby:
df = pd.DataFrame({'year':[1,2,3,3], 'name':['A','B','A','B'], 'value':[10,20,25,10]})
df.groupby('name').apply(lambda x: (x['value'].iloc[1]-x['value'].iloc[0])/(x['year'].iloc[1]-x['year'].iloc[0]))
>>>
name
A 7.5
B -10.0
dtype: float64
Or to have more flexibility, you could define an aggregation function:
def value_change(x):
yr1 = min(x['year'])
yr2 = max(x['year'])
# Get values corresponding to min and max years in case
# min and max year rows aren't contiguous
value1 = x[x['year']==yr1]['value'].iloc[0]
value2 = x[x['year']==yr2]['value'].iloc[0]
return (value2-value1)/(yr2-yr1)
df.groupby('name').apply(value_change)
So first you want the rows the which correspond to the first and last year for each name:
df = pd.DataFrame([[1,"A", 10], [2, "B", 20], [3, "A", 25], [3, "B", 10]], columns=["Year", "Name", "Value"])
first_year = df.groupby("Name").Year.idxmin().sort_index()
last_year = df.groupby("Name").Year.idxmax().sort_index()
I added the sort_index just to be sure that the order lines up but it might be a good idea to test this to make sure that nothing went wrong. Here's how I would test it:
assert (first_year.index == last_year.index).all()
Next, you can use these to extract the values you want and compute the change:
change_value = (df.loc[last_year.values, "Value"].values - df.loc[first_year.values, "Value"].values)
change_time = (df.loc[last_year.values, "Year"].values - df.loc[first_year.values, "Year"].values)
change_over_time = change_value / change_time
Then, you can convert this into pandas frame if you'd like:
pd.Series(change_over_time, index=first_year.index)
There probably a tidier way to do this all within the groupby but I think this way easier to understand.
Related
I have a dataframe such as below:
example_df = pd.DataFrame({"group_id": ["12356", "12356", "12359", "12356", "12359"], "date": ["2021-12-03", "2021-12-05", "2021-05-06", "2021-11-04", "2021-06-05"]})
I need to find the difference of dates in date column for each group id. For example for group_id=12356 there are 3 dates available
["2021-12-03", "2021-12-05", "2021-11-04"]
I need the difference between these days in days. There are 3 combinations with these given dates. I wrote a code that does not work the way I want and is very slow because I am using iterrows. Is there a shorter and easier way to achieve this?
My code:
%%time
date_diff_dict = {}
for index, row in example_df.iterrows():
group_id = row.group_id
group_id_df = example_df[example_df.group_id == group_id]
date_diff_list = []
for idx, rw in group_id_df.iterrows():
if (row.order_id != rw.order_id) & (row.order_id >= rw.order_id):
date_diff = np.abs((row.date - rw.date).days)
print(row.date, rw.date)
print(date_diff)
date_diff_list.append(date_diff)
print(date_diff_list)
date_diff_dict[str(group_id)] = date_diff_list
This code gives partially correct answer but misses a day.
Expected output is :
{'12356': [2, 29, 31], '12359': [30]}
Here's one way:
(i) convert "date" from string literals to datetime object
(ii) groupby "date"
(iii) for each group, use itertools.combinations to find pairs of dates
(iv) find the absolute difference in days between the pairs of dates
from itertools import combinations
example_df['date'] = pd.to_datetime(example_df['date'])
out = example_df.groupby('group_id')['date'].apply(lambda date: [abs((y-x).days) for x,y in combinations(date, 2)]).to_dict()
Output:
{'12356': [2, 29, 31], '12359': [30]}
I have a DataFrame
index
Street
House
Building
1
ABC
20
a
2
ABC
20
b
3
ABC
21
NaN
4
BCD
2
1
Need to create a multiple selection from this DataFrame:
If Street == str_filter;
If House == house_filter;
If Building == build_filter AND If Building is not NULL;
I've already tried df[(df['Street'] == str_filter) & (df['House'] == house_filter) & ((df['Building'] == build_filter) & (pd.notnull(df['Building'])))
But doesn't lead to a result I particularly want to see. I have to check if the Building value is not NaN and if it's true select the row with the certain Building num. However, I also want to select the row if it has NaN value for the Building but also meets other criteria.
Another idea was to create lists for the set of filter values and the set of this values meeting pd.notnull criteria:
filter_values = [str_filter, house_filter, build_filter]
notnull_values = [pd.notnull(entry) for entry in filter_values]
This one doesn't meet the performance criteria, because I have extremely huge DataFrame and creating additional lists with additional filtering will lead to the out-performance. Possible solution may lay in the df.loc function, but I don't know how to realise it.
To summarise, the problem is the following: How to create multiple selection in pandas with conditions for NaN values?
UPD: It seems that the function I have to use is df[... & (df['Building'] == 'a' if pd.notnull(df['Building']))] using an analogy with lambda apply trick
Since you said in a comment that you also had issues with duplicates, I've added an option to remove these.
I have recreated your dataset with this, and added a duplicate.
df = pd.DataFrame({
"Street" : ["ABC", "ABC", "ABC", "BCD", "ABC"],
"House" : [20, 20, 21, 2, 21],
"Building" : ['a', 'b', np.NaN, 1, np.NaN]
})
To remove duplicates from your dataset you can simply run this block:
df = df.drop_duplicates()
I am assuming that your filter might be a whitelist of strings or numbers that you want to include in your final dataset. Therefore I took the liberty of defining a filter as this:
street_filter = ["ABC", "BCD"]
To clean the dataset you could then use this method:
def get_mask(df, street_filter, house_filter, building_filter):
not_na_mask = df['Building'].notna()
# hopefully smaller df to query
df = df[not_na_mask]
street_mask = df['Street'].apply(lambda x: x in street_filter)
df = df[street_mask]
house_mask = df['House'].apply(lambda x: x in house_filter)
df = df[house_mask]
building_mask = df['Building'].apply(lambda x: x in building_filter)
return df[building_mask]
Example
street_filter = ["ABC", "BCD"]
house_filter = [20, 21]
building_filter = ["a"]
get_mask(df, street_filter, house_filter, building_filter)
Street House Building
0 ABC 20 a
A rather simple way to do it would be this:
if df[df['Building'].isnull()].empty:
df[df['Street'] == str_filter][df['House'] == house_filter][df['Building'] == build_filter]
else:
df[df['Street'] == str_filter][df['House'] == house_filter][df['Building'].isnull()]
I'm not sure if this would satisfy your performance requirements?
I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)
For a project, I want to create a script that allows the user to enter values (like a value in centimetres) multiple times. I had a While-loop in mind for this.
The values need to be stored in a dataframe, which will be used to generate a graph of the values.
Also, there is no maximum nr of entries that the user can enter, so the names of the variables that hold the values have to be generated with each entry (such as M1, M2, M3…Mn). However, the dataframe will only consist of one row (only for the specific case that the user is entering values for).
So, my question boils down to this:
How do I create a dataframe (with pandas) where the script generates its own column name for a measurement, like M1, M2, M3, …Mn, so that all the values are stored.
I can't acces my code right now, but I have created a While-loop that allows the user to enter values, but I'm stuck on the dataframe and columns part.
Any help would be greatly appreciated!
I agree with #mischi, without additional context, pandas seems overkill, but here is an alternate method to create what you describe...
This code proposes a method to collect the values using a while loop and input() (your while loop is probably similar).
colnames = []
inputs = []
counter = 0
while True:
value = input('Add a value: ')
if value == 'q': # provides a way to leave the loop
break
else:
key = 'M' + str(counter)
counter += 1
colnames.append(key)
inputs.append(value)
from pandas import DataFrame
df = DataFrame(inputs, colnames) # this creates a DataFrame with
# a single column and an index
# using the colnames
df = df.T # This transposes the DataFrame to
# so the indexes become the colnames
df.index = ['values'] # Sets the name of your row
print(df)
The output of this script looks like this...
Add a value: 1
Add a value: 2
Add a value: 3
Add a value: 4
Add a value: q
M0 M1 M2 M3
values 1 2 3 4
pandas seems a bit of an overkill, but to answer your question.
Assuming you collect numerical values from your users and store them in a list:
import numpy as np
import pandas as pd
values = np.random.random_integers(0, 10, 10)
print(values)
array([1, 5, 0, 1, 1, 1, 4, 1, 9, 6])
columns = {}
column_base_name = 'Column'
for i, value in enumerate(values):
columns['{:s}{:d}'.format(column_base_name, i)] = value
print(columns)
{'Column0': 1,
'Column1': 5,
'Column2': 0,
'Column3': 1,
'Column4': 1,
'Column5': 1,
'Column6': 4,
'Column7': 1,
'Column8': 9,
'Column9': 6}
df = pd.DataFrame(data=columns, index=[0])
print(df)
Column0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 1 5 0 1 1 1 4 1
Column8 Column9
0 9 6
I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.
Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In Pandas, I'm trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Not sure where to go from here. Any ideas?
One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.
Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.
import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"
As mentioned in the comments, you can also do the assignment to both columns in one shot:
df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'
Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
You can use map, it can map vales from a dictonairy or even a custom function.
Suppose this is your df:
ID First_Name Last_Name
0 103 a b
1 104 c d
Create the dicts:
fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}
And map:
df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)
The result will be:
ID First_Name Last_Name
0 103 Matt Jones
1 104 Mr X
Or use a custom function:
names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.
In this way it's possible to avoid key errors.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
... def __missing__(self, key):
... return ''
...
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.
df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})
col_a col_b
0 0 1
1 0 2
def func(row):
if row.col_a == 0 and row.col_b <= 1:
row.col_a = -1
row.col_b = -1
return row
df.apply(func, axis=1)
col_a col_b
0 -1 -1 # Modified row
1 0 2
Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.
I found it much easier to debut by printing out where each row meets the condition:
for n in df.columns:
if(np.where(df[n] == 103)):
print(n)
print(df[df[n] == 103].index)