Read and split a column values from a dataframe - python

I have a dataset, where the second column looks like this.
FileName
892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm
2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm
1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm
Now, I want to extract the name before ".exe.asm" from the column and append it to a new list for all the rows of my dataset. I tried the following code:
import pandas as pd
df = pd.read_csv("dataset1.csv")
exekey = []
for row in df.iterrows():
exekey.append(row[1].split('.'))
exekey
This execution gave me the following error:
AttributeError: 'Series' object has no attribute 'split'
I am not able to do it. Please help
On changing, the output was of the form Output image

Split the filename using . and access 1st element using indexing.
import pandas as pd
df = pd.DataFrame({'FileName':['892e7c8382943342a29a6ae5a55f2272532d8e04.exe.asm',
'2d42c1b2c33a440d165683eeeec341ebf61218a1.exe.asm',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed.exe.asm']})
exekey = [i.split(".")[0] for i in df['FileName']]
print(exekey)
Alternate way:
exekey2 = df['FileName'].apply(lambda x: x.split(".")[0]).tolist()
Output:
['892e7c8382943342a29a6ae5a55f2272532d8e04', '2d42c1b2c33a440d165683eeeec341ebf61218a1', '1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']

You can use map like this to split on . and take index 0,
df['FileName'].map(lambda f : f.split('.')[0])
# Output
0 892e7c8382943342a29a6ae5a55f2272532d8e04
1 2d42c1b2c33a440d165683eeeec341ebf61218a1
2 1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed
Name: FileName, dtype: object
If you want to get a list of names you can do,
df['FileName'].map(lambda f : f.split('.')[0]).values.tolist()
# Output : ['892e7c8382943342a29a6ae5a55f2272532d8e04',
'2d42c1b2c33a440d165683eeeec341ebf61218a1',
'1fbab6b4566a2465a8668bbfed21c0bfaa2c2eed']

Related

How do I capture the properties I want from a string?

I hope you are well I have the following string:
"{\"code\":0,\"description\":\"Done\",\"response\":{\"id\":\"8-717-2346\",\"idType\":\"CIP\",\"suscriptionId\":\"92118213\"},....\"childProducts\":[]}}"...
To which I'm trying to capture the attributes: id, idType and subscriptionId and map them as a dataframe, but the entire body of the .cvs puts it in a single row so it is almost impossible for me to work without index
desired output:
id, idType, suscriptionID
0. '7-84-1811', 'CIP', 21312421412
1. '1-232-42', 'IO' , 21421e324
My code:
import pandas as pd
import json
path = '/example.csv'
df = pd.read_csv(path)
normalize_df = json.load(df)
print(df)
Considering your string is in JSON format, you can do this.
drop columns, transpose, and get headers right.
toEscape = "{\"code\":0,\"description\":\"Done\",\"response\":{\"id\":\"8-717-2346\",\"idType\":\"CIP\",\"suscriptionId\":\"92118213\"}}"
json_string = toEscape.encode('utf-8').decode('unicode_escape')
df = pd.read_json(json_string)
df = df.drop(["code","description"], axis=1)
df = df.transpose().reset_index().drop("index", axis=1)
df.to_csv("user_details.csv")
the output looks like this:
id idType suscriptionId
0 8-717-2346 CIP 92118213
Thank you for the question.

unable to parse using pd.json_normalize, it throws null with index values

Sample of my data:
ID
target
1
{"abc":"xyz"}
2
{"abc":"adf"}
this data was a csv output that i imported as below in python
data=pd.read_csv('location',converters{'target':json.loads},header=None,doublequote=True,encoding='unicode_escape')
data=data.drop(labels=0,axis=0)
data=data.rename(columns={0:'ID',1:'target'})
when I try to parse this data using
df=pd.json_normalize(data['target'])
i get Empty dataframe
0
1
You need to change the cells from strings to actual dicts and then your code works.
Try this:
df['target'] = df['target'].apply(json.loads)
df = pd.json_normalize(df['target'])

Fetching 2 columns from Tuple using Python

I have a tuple which looks like this when I iterate through its rows:
for row in df.itertuples(index=False, name=None):
print(row)
o/p :
(100214, '120.6843686', '-41.9098438')
(101105, '121.7692179', '-42.2737880')
(101847, '122.6417215', '-43.8718865')
Output Desired:
('120.6843686', '-41.9098438')
('121.7692179', '-42.2737880')
('122.6417215', '-43.8718865')
I am new to Python, so any help would really be appreciated.
Thanks..
Use the following code:
for row in df.itertuples(index=False, name=None):
print(row[1:])
This slices the tuple and displays everything after column 0. This article explains it in further detail if you're interested.
If you are just trying to get values here's a simple way:
import pandas as pd
df = pd.DataFrame((
(100214, '120.6843686', '-41.9098438'),
(101105, '121.7692179', '-42.2737880'),
(101847, '122.6417215', '-43.8718865'))
)
df = df.iloc[:, 1:].values.tolist()
print(df)
[['120.6843686', '-41.9098438'],
['121.7692179', '-42.2737880'],
['122.6417215', '-43.8718865']]

Column in DataFrame isn't recognised. Keyword Error: 'Date'

I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'

Get the name of a pandas DataFrame

How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB

Categories