I have CSV file with the following test content:
Name;N_D;mu;set
A;10;20;0
B;20;30;0
C;30;40;0
x;5;15;1
y;15;25;1
z;25;35;1
I'm reading the file with pandas, group the data and then iterate through the data. Within each group, I want to iterate through the rows of the data set:
import pandas as pd
df = pd.read_csv("samples_test.csv", delimiter=";", header=0)
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(txt, group["Name"][i])
The code fails on the line print(txt, group["Name"][i]) at the first element of the second group with an key error. I don't understand, why...
Your code fails since the series index does not match with the enumerator index for each loop hence cannot match the keys for filtering, (Note: Also use .loc[] or .iloc[] and avoid chained indexing group["Name"][i])
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(i,group["Name"])
0 0 A
1 B
2 C
Name: Name, dtype: object
1 0 A
1 B
2 C
.......
....
Your code should be changed to below using .iloc[] and get_loc for getting the column index:
groups = df.groupby("set")
for name, group in groups:
somestuff = [group["N_D"], group["mu"], name]
for i, txt in enumerate(group["Name"]):
print(txt,group.iloc[i,group.columns.get_loc('Name')])
A A
B B
C C
x x
y y
z z
Related
I am unable to compare two columns inside a grouped pandas dataframe.
I used groupby method to group the fields with respect to two columns
I am required to get the list of fields that are not matching with the actual output.
file_name | page_no | field_name | value | predicted_value | actual_value
-------------------------------------------------------------------------
A 1 a 1 zx zx
A 2 b 0 xt xi
B 1 a 1 qw qw
B 2 b 0 xr xe
desired output:
b
Because b is the only field that is causing the mismatch between the two columns
The following is my code:
groups = df1.groupby(['file_name', 'page_no'])
a = pd.DataFrame(columns = ['file_name', 'page_no', 'value'])
for name, group in groups:
lst = []
if (group[group['predicted_value']] != group[group['actual_value']]):
lst = lst.append(group[group['field_name']])
print(lst)
I am required to get the list of fields that are not matching with the actual output.
Here, I'm trying to store them in a list but I am getting some key error.
The error is as follows:
KeyError: "None of [Index(['A', '1234'')] are in the [columns]"
Here is solution for test columns outside groups:
df1 = df[df['predicted_value'] != df['actual_value']]
s = df.loc[df['predicted_value'] != df['actual_value'], 'field_name']
L = s.tolist()
Does this solve your problem ?
# Create a new dataframe retrieving only non-matching values
df1=df[df['predicted_value']!=df['actual_value']]
# Store 'field_name' column in a list format
lst=list(df1['field_name'])
print(lst)
I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None
I am importing a .txt file via read_table and get a DataFrame similar to
d = ['89278 5857', '1.000e-02', '1.591184e-02', '2.100053e-02', '89300 5857', '4.038443e-01', '4.037924e-01', '4.037336e-01']
df = pd.DataFrame(data = d)
and would like to reorganize it into
r = {'89278 5857': [1.000e-02, 1.591184e-02, 2.100053e-02], '89300 5857': [4.038443e-01, 4.037924e-01, 4.037336e-01]}
rf = pd.DataFrame(data = r)
The .txt file is typically 50k+ rows with an unknown number of '89278 5857' type values.
Thanks!
You can use itertools.groupby:
from itertools import groupby
data, cur_group = {}, None
for v, g in groupby(df[0], lambda k: " " in k):
if v:
cur_group = []
data[next(g)] = cur_group
else:
cur_group.extend(g)
df = pd.DataFrame(data)
print(df)
Prints:
89278 5857 89300 5857
0 1.000e-02 4.038443e-01
1 1.591184e-02 4.037924e-01
2 2.100053e-02 4.037336e-01
Assuming what delineates the start of the next group is a space, here what I would:
df.assign(
key=lambda df: numpy.where(
df['value'].str.contains(' '), # what defines each group
df['value'],
numpy.nan
),
).fillna(
method='ffill' # copy the group label down until the next group starts
).loc[
lambda df: df['value'] != df['key'] # remove the rows that kicked off each group
].assign(
idx=lambda df: df.groupby('key').cumcount() # get a row number for each group
).pivot(
index='idx', # pivot into the wide format
columns='key',
values='value'
).astype(float) # turn values into numbers instead of strings
And I get:
key 89278 5857 89300 5857
idx
0 0.010000 0.403844
1 0.015912 0.403792
2 0.021001 0.403734
I have a csv file with items like these: some,foo,bar and i have a different list in python with different item like att1,some,bar,try,other . Is possible for every list, create a row in the same csv file and set 1 in correspondence of the 'key' is present and 0 otherwise? So in this case the csv file result would be:
some,foo,bar
1,0,1
Here's one approach, using Pandas.
Let's say the contexts of example.csv are:
some,foo,bar
Then we can represent sample data with:
import pandas as pd
keys = ["att1","some","bar","try","other"]
data = pd.read_csv('~/Desktop/example.csv', header=None)
print(data)
0 1 2
0 some foo bar
matches = data.apply(lambda x: x.isin(keys).astype(int))
print(matches)
0 1 2
0 1 0 1
newdata = pd.concat([data, matches])
print(newdata)
0 1 2
0 some foo bar
0 1 0 1
Now write back to CSV:
newdata.to_csv('example.csv', index=False, header=False)
# example.csv contents
some,foo,bar
1,0,1
Given data and keys, we can condense it all into one chained command:
(pd.concat([data,
data.apply(lambda x: x.isin(keys).astype(int))
])
.to_csv('example1.csv', index=False, header=False))
You could create a dictionary with the keys as the column names.
csv = {
"some" : [],
"foo" : [],
"bar" : []
}
Now, for each list check if the values are in the dict's keys, and append as necessary.
for list in lists:
for key in csv.keys():
if key in list:
csv[key].append(1)
else:
csv[key].append(0)
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.