How to access data in a re.findinter output object? - python

I'd like to access the 'span' and 'match' data from the object I've generated with regex.findinter. But I can't find how to transfer the object structure into a pandas df so I can manipulate it more easily.
I can iterate through the object to print the data. But the regex.findinter documentation does not say how to access the data. The best I can find is the page https://docs.python.org/2.0/lib/match-objects.html
I tried just appending the rows to a pandas df but no luck. See code. It gives error:
TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
import re
import pandas as pd
def find_rez(string):
regex = re.compile(r'\s\d{10}\s')
return(regex.finditer(string))
#open file with text data
file = open('prepaid_transactions_test2.txt')
text = file.read()
#get regex object with locations of all matches.
rez_mo = find_rez(text)
#Create empty df with span and match columns.
df = pd.DataFrame(columns=['span','match'])
#Append each row from object to pandas df. NOT WORKING.
for i in rez_mo:
df.append(i)
I'd like to have a pandas df with the range & match as columns. But I'm failing at converting the types it seems.

I just found a solution. May not be most elegant but....it works.
for i in rez_mo:
df.loc[len(df)]=[i.start()],[i.group()]

Related

Python - How to perform log2 normalisation only on selected range of rows and cols only

I want to perform log2 transformation only on int data type in dataframe without loosing (string) labels in 1st column and row
In excel it looks like this after conversion
Please suggest me how to code in python as I am getting error using simple np.log2() transformation due to presence of string
Code:
import pandas as pd
import numpy as np
data1 = pd.read_excel("#here i mention Path Of ExcelFile")
data2 = np.log2(data1)
error:
AttributeError: 'str' object has no attribute 'log2'
I think first column in excel should be converted to index for only numeric columns:
data1 = pd.read_excel("#here i mention Path Of ExcelFile", index_col=0)
data2 = np.log2(data1)

How do I extract the date from a column in a csv file using pandas?

This is the 'aired' column in the csv file:
as
Link to the csv file:
https://drive.google.com/file/d/1w7kIJ5O6XIStiimowC5TLsOCUEJxuy6x/view?usp=sharing
I want to extract the date and the month (in words) from the date following the 'from' word and store it in a separate column in another csv file. The 'from' is an obstruction since had it been just the date it would have been easily extracted as a timestamp format.
You are starting from a string and want to break out the data within it. The single quotes is a clue that this is a dict structure in string form. The Python standard libraries include the ast (Abstract Syntax Trees) module whose literal_eval method can read a string into a dict, gleaned from this SO answer: Convert a String representation of a Dictionary to a dictionary?
You want to apply that to your column to get the dict, at which point you expand it into separate columns using .apply(pd.Series), based on this SO answer: Splitting dictionary/list inside a Pandas Column into Separate Columns
Try the following
import pandas as pd
import ast
df = pd.read_csv('AnimeList.csv')
# turn the pd.Series of strings into a pd.Series of dicts
aired_dict = df['aired'].apply(ast.literal_eval)
# turn the pd.Series of dicts into a pd.Series of pd.Series objects
aired_df = aired_dict.apply(pd.Series)
# pandas automatically translates that into a pd.DataFrame
# concatenate the remainder of the dataframe with the new data
df_aired = pd.concat([df.drop(['aired'], axis=1), aired_df], axis=1)
# convert the date strings to datetime values
df_aired['aired_from'] = pd.to_datetime(df_aired['from'])
df_aired['aired_to'] = pd.to_datetime(df_aired['to'])
import pandas as pd
file = pd.read_csv('file.csv')
result = []
for cell in file['aired']:
date = cell[8:22]
date_ts = pd.to_datetime(date, format='%Y-%m-%d')
result.append((date_ts.month_name(), date_ts))
df = pd.DataFrame(result, columns=['month', 'date'])
df.to_csv('result_file.csv')

Excel Column Converter with a specific Column Does not works

I tried to code the program that allows the user enter the column and sort the column and replace the cell to the other entered information but I probably get syntact errors
I tried to search but I could not find any solution
import pandas as pd
data = pd.read_csv('List')
df = pd.DataFrame(data, columns = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'])
findL = ['example']
replaceL = ['convert']
col = 'C';
df[col] = df[col].replace(findL, replaceL)
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I seems that your df[col] and findLand replaceLdo not have the same datatype. Try to run df[col] = df[col].astype(str) beofre you run df[col]=df[col].replace(findL, replaceL)and it should work
If the column/s you are dealing with has blank entries in it, you have to specify the na_filter parameter in .read_csv() method to be False.
That way, it will take all the column entries with blank/empty values as str and thus the not empty ones as str as well.
Doing the .replace() method using this will not give a TypeError as you will be parsing through both columns as strings and not 'ndarray(dtype=float64) and str.

JSON string within CSV data read by pandas [duplicate]

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)

Get HTML table into pandas Dataframe, not list of dataframe objects

I apologize if this question has been answered elsewhere but I have been unsuccessful in finding a satisfactory answer here or elsewhere.
I am somewhat new to python and pandas and having some difficulty getting HTML data into a pandas dataframe. In the pandas documentation it says .read_html() returns a list of dataframe objects, so when I try to do some data manipulation to get rid of the some samples I get an error.
Here is my code to read the HTML:
df = pd.read_html('http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)
Then I try to clean it up:
df = df.dropna(axis=0, thresh=4)
And I received the following error:
Traceback (most recent call last): File "module4.py", line 25, in
<module> df = df.dropna(axis=0, thresh=4) AttributeError: 'list'
object has no attribute 'dropna'
How do I get this data into an actual dataframe, similar to what .read_csv() does?
From https://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html, read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content".
So df = df[0].dropna(axis=0, thresh=4) should do what you want.
pd.read_html returns you a list with one element and that element is the pandas dataframe, i.e.
df = pd.read_html(url) ###<-- List
df[0] ###<-- Pandas DataFrame

Categories