So I got this DataFrame, built in a way so that for column id equal to 2, we have two different values in column num and my_date:
import pandas as pd
a = pd.DataFrame({'id': [1, 2, 3, 2],
'my_date': [datetime(2017, 1, i) for i in range(1, 4)] + [datetime(2017, 1, 1)],
'num': [2, 3, 1, 4]
})
For convenience, this is the DataFrame in a readable visual:
If I want to count the number of unique values for each id, I'd do
grouped_a = a.groupby('id').agg({'my_date': pd.Series.nunique,
'num': pd.Series.nunique}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
which gives this weird (?) result:
Looks like the counting unique values on the datetime (which in Pandas converts to a datetime64[ns]) type is not working?
It is bug, see github 14423.
But you can use SeriesGroupBy.nunique which works nice:
grouped_a = a.groupby('id').agg({'my_date': 'nunique',
'num': 'nunique'}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
If DataFrame have only 3 columns, you can use:
grouped_a = a.groupby('id').agg(['nunique']).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
Related
I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.
Reproducible dataframe-
import pandas as pd
data = {'refid': ['1.2.34',
'1.2.35',
'1.3.66',
'1.6.99',
'1.9.00',
'1.87.66',
'1.98.00',
'1.100.1',
'1.101.3'],
}
my_index = pd.MultiIndex.from_arrays([["A"]*6 + ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)
I want a new column which clubs both ID-B and refid to second delimiter.
ex- for ID-B 1 and refid 1.2.34, first the secondary-refid column should be 1.2 and unique ID should be 1_1.2
You can use get_level_values with str.extract and concatenate the values converted as string:
df['new'] = (df.index.get_level_values('ID-B').astype(str)+'_'
+df['refid'].str.extract('(\d+\.\d+)', expand=False)
)
output:
refid new
ID-A ID-B
A 1 1.2.34 1_1.2
1 1.2.35 1_1.2
1 1.3.66 1_1.3
2 1.6.99 2_1.6
2 1.9.00 2_1.9
2 1.87.66 2_1.87
B 1 1.98.00 1_1.98
1 1.100.1 1_1.100
1 1.101.3 1_1.101
I am working on extracting JSON array column in dataframe in Python using Pandas Library, where I have a data like this
>df
id partnerid payments
5263 org1244 [{"sNo": 1, "amount":"1000"}, {"sNo": 2, "amount":"500"}]
5264 org1245 [{"sNo": 1, "amount":"2000"}, {"sNo": 2, "amount":"600"}]
5265 org1246 [{"sNo": 1, "amount":"3000"}, {"sNo": 2, "amount":"700"}]
I want to extract the JSON data inside the list and add it as a column in same dataframe
like this
>mod_df
id partnerid sNo amount
5263 org1244 1 1000
5263 org1244 2 500
5264 org1245 1 2000
5264 org1245 2 600
5265 org1246 1 3000
5265 org1246 2 700
I have tried with this approach
import pandas as pd
import json as j
df = pd.read_parquet('sample.parquet')
js_loads = df['payments'].apply(j.loads)
js_list = list(js_loads)
j_data = j.dumps(js_list)
df = df.join(pd.read_json(j_data))
df = df.drop(columns=['payments'] , axis=1)
But this works, only if we have JSON data in column not list of JSON.
Can someone explain, how can I achieve my desired output?
Convert it to list by ast.literal_eval and use explode()to transform each element to a row and also replicate the other columns.
Then, use .apply(pd.Series) to convert dict-like to series.
Finally, concatenate to original dataframe using pd.concat().
Example:
import ast
# sample data
d = {'col1': [0, 1, 2], 'payments': ['[{"sNo": 1, "amount":"1000"}, {"sNo": 2, "amount":"500"}]', '[{"sNo": 1, "amount":"2000"}, {"sNo": 2, "amount":"600"}]', '[{"sNo": 1, "amount":"3000"}, {"sNo": 2, "amount":"700"}]']}
df = pd.DataFrame(data=d, index=[0, 1, 2])
df['payments'] = df['payments'].apply(ast.literal_eval)
df = df.explode('payments')
out = pd.concat([df.drop(['payments'], axis=1), df['payments'].apply(pd.Series)], axis=1).reset_index(drop=True)
output:
col1 sNo amount
0 0 1 1000
1 0 2 500
2 1 1 2000
3 1 2 600
4 2 1 3000
5 2 2 700
I have a subset of a dataframe here:
data = {'Name': ['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6'],
'Time': [1,2,3,4,5,6],
'Week' : [1, 2, 3, 2, 3, 2]
}
dfx = pd.DataFrame(data)
I need to sum up all the times for each week so Week 1 time is 1, Week 2 time is 2+4+6, and Week 3 is 3+5. I also need it to look through the 'Week' column and find all the different weeks, so for this example there are 3 but for another dataframe it could be 2 or 4.
End result is look through a column in a dataframe, find the unique values (1,2,3,...n), groupby be each of those values into rows and sum up the time for each of those values.
I have tried a handful of ways but nothing is really working how I would like. I appreciate any help or ideas.
Expected Output:
Sum
Week 1: 1 1
Week 2: 2 4 6 12
Week 3: 3 5 8
The output can be either individual dataframes of the data or one dataframe that has all three rows with all the numbers and the sum at the end.
import pandas as pd
data = {'Name': ['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6'],
'Time': [1,2,3,4,5,6],
'Week' : [1, 2, 3, 2, 3, 2]
}
dfx = pd.DataFrame(data)
dfx = dfx.groupby('Week')['Time'].sum()
print(dfx)
output:
Week
1 1
2 12
3 8
You can groupby "Week", select column "Time", and you can pass multiple functions (such as list constructor and sum) to Groupby.agg to do the things you want:
out = dfx.groupby('Week')['Time'].agg(Times=list, Total=sum)
Output:
Times Total
Week
1 [1] 1
2 [2, 4, 6] 12
3 [3, 5] 8
I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1