Removing Unicode from Pandas Column Text

Removing Unicode from Pandas Column Text - python

I am attempting to determine if the data inside a list is within a dataframe column. I am new to Pandas and have been struggling with this, so at the moment I am turning the dataframe column of interest into a list. However, when I df.tolist() the list contains a slew of unicode around the string. As i am attempting to compare this with text from the other list which is not in unicode I am running into issues.
I am attempted to turn the other list into unicode but then the list had items such that read like u'["item"]' which didn't help. I have also tried to remove the unicode from the dataframe but only get errors. I cannot iterate as pandas tells me that the dataframe is to long to iterate over. Below is my code:
SDC_wb = pd.ExcelFile('C:\ BLeh')
df = SDC_wb.parse(SDC_wb.sheet_names[1], header = 1)
def Follower_count(filename):
filename = open(filename)
reader = csv.reader(filename)
handles = df['things'].tolist()
print handles
dict1 = {}
for item in reader:
if item in handles:
user = api.get_user(item)
dict1[item] = user.Follower_count
newdf = pd.DataFrame(dict1)
newdf.to_csv('test1.csv', encoding='utf-8')
Here is what the list from the dataframe looks like:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
Here is what x = [unicode(s) for s in some_list] looks like:
u"['#HomeGoods']", u"['#pier1']", u"['#houzz']", u"['#InteriorDesign']", u"['#zulily']"]
Naturally these don't align to check the "in" requirement. Thus, I need a method of converting the .tolist() object from:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
to:
[#Mastercard, #Visa, #AmericanExpress, #CapitalOne]
so that the for item in handles function will see similar handles.
Thanks for your help.

Related

Getting nested cells when using dictionaries in pandas (lasio for .LAS files)

I am using lasio (https://lasio.readthedocs.io/en/latest/index.html) to call out data within a .LAS file. It's an oil and gas drilling type file with data in the heading and in the body (called the curve). TL;DR on the lasio docs, but it reads the data as a pandas DataFrame. Hence me using a dictionary to assign the data.
This is an output of a lasio file in notepad:
At the end, I need a file that has the UWI (unique well #), the depth and it's porosity reading.
The UWI is one value but there are multiple values for the depth and porosity. So I need the UWI repeated. To complicate matters, not all of my files have the porosity data so I have had to screen for them too.
My code was going ok until I export it and see that in the csv, the cells are nested. The code reads in the values in a dictionary and I need the UWI duplicated for each depth value.
data = []
df_global = pd.DataFrame(data)
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
DEPT = df["DEPT"]
DPHI2 = df[match]
DPHI = DPHI2.iloc[:,0]
UWI = las.well.UWI.value
df_global = df_global.append({'UWI': UWI, 'DEPTH': DEPT, 'DPHI': DPHI}, ignore_index=True)
df_global.to_csv('las_output.csv', index=False)
This is my output, note the nested rows.
I have tried
df.loc[:,"UWI"] = np.array(las.well.UWI.value*len(df.DEPT))
but the UWI value is just repeated and not put into rows.

Problem
You are appending dictionaries to an already-existing DataFrame. Each dictionary contains a variety of types (an integer under the key UWI, and pandas Series under other keys). This is a very general operation, and pandas reacts by converting the Series contained within the dictionary to strings, which is what you are seeing in columns B and C in Excel.
This is also probably not the operation you want to do, which appears to be appending DataFrames (i.e. one per file) to an existing DataFrame (df_global). Pandas does not make this easy for existing DataFrames, for good reason.
Solution
This is much simpler if you create a Python list (data) containing DataFrames, then use pandas' concat function to create a single DataFrame as the last step. See below for an example. I have not tested the code, because you didn't include a minimal reproducible example, but hopefully it helps.
data = []
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
columns_to_keep = [las.curves[0].mnemonic] + list(match)
# Assign the single UWI value to a new column called "UWI"
df['UWI'] = las.well.UWI.value
columns_to_keep.append('UWI')
data.append(df[columns_to_keep])
df_final = pd.concat(data, join='outer') # join='outer' means that it will keep all of the different values found from `alias`
df_final.to_csv('las_output.csv', index=False)

How can I see a list of the variables in a CSV column?

I have a csv file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to have a code to list the variables in a specific column.
For example, I'd like it to return {Kish, Qeshm, Tabriz} for the 'city' column.

You need to first to import the csv module into your python file and read over each row in the file and save it in a list, so it'll be like
import csv
cities = []
with open("yourfile.csv", "r") as file:
reader = csv.DictReader(file) //This will save the values in the very top of the csv file as header so it will skip a line
for row in reader:
city = row["City"]
cities.append(city)
this will give you a list of cities=[Kish, Qesh, Tabriz, ....]

It appears you want to remove duplicates as well, which you can have by simply cast the finished list to set. Here's how to do it with pandas:
import pandas as pd
cities = pd.read_csv('yourfile.csv', usecols=['City'])['City']
# just cast to list if you want a plain list instead of a DataFrame
cities_list = list(cities)
# use set to remove the duplicates
unique_cities = set(cities)
In case you have need to preserve ordering, you might use an ordered dict with just keys.
Also, in case you're short on memory trying to read 5M rows in one go, you can read them in chuncks:
import pandas as pd
cities_chunks_list = [chunck['City'] for chunck in pd.read_csv('yourfile.csv', usecols=['City'], chunksize = 1000)]
#let's flatten the list
cities_list = [city for cities_chunk in cities_chunks_list for city in cities_chunk]
Hope I helped.

Reading rows in CSV file and appending a list creates a list of lists for each value

I am copying list output data from a DataCamp course so I can recreate the exercise in Visual Studio Code or Jupyter Notebook. From DataCamp Python Interactive window, I type the name of the list, highlight the output and paste it into a new file in VSCode. I use find and replace to delete all the commas and spaces and now have 142 numeric values, and I Save As life_exp.csv. Looks like this:
43.828
76.423
72.301
42.731
75.32
81.235
79.829
75.635
64.062
79.441
When I read the file into VSCode using either Pandas read_csv or csv.reader and use values.tolist() with Pandas or a for loop to append an existing, blank list, both cases provide me with a list of lists which then does not display the data correctly when I try to create matplotlib histograms.
I used NotePad to save the data as well as a .csv and both ways of saving the data produce the same issue.
import matplotlib.pyplot as plt
import csv
life_exp = []
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row)
And
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
life_exp = life_exp_df.values.tolist()
When you print life_exp after importing using csv, you get:
[['43.828'],
['76.423'],
['72.301'],
['42.731'],
['75.32'],
['81.235'],
['79.829'],
['75.635'],
['64.062'],
['79.441'],
['56.728'],
….
And when you print life_exp after importing using pandas read_csv, you get the same thing, but at least now it's not a string:
[[43.828],
[76.423],
[72.301],
[42.731],
[75.32],
[81.235],
[79.829],
[75.635],
[64.062],
[79.441],
[56.728],
…
and when you call plt.hist(life_exp) on either version of the list, you get each value as bin of 1.
I just want to read each value in the csv file and put each value into a simple Python list.
I have spent days scouring stackoverflow thinking someone has done this, but I can't seem to find an answer. I am very new to Python, so your help greatly appreciated.

Try:
import pandas as pd
life_exp_df = pd.read_csv('c:\\data\\life_exp.csv', header = None)
# Select the values of your first column as a list
life_exp = life_exp_df.iloc[:, 0].tolist()
instead of:
life_exp = life_exp_df.values.tolist()

With csv reader, it will parse the line into a list using the delimiter you provide. In this case, you provide \n as the delimiter but it will still take that single item and return it as a list.
When you append each row, you are essentially appending that list to another list. The simplest work-around is to index into row to extract that value
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
life_exp.append(row[0])
However, if your data is not guaranteed to be formatted the way you have provided, you will need to handle that a bit differently:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
for row in exp_read:
for number in row:
life_exp.append(number)
A bit cleaner with list comprehension:
with open ('C:\data\life_exp.csv', 'rt') as life_expcsv:
exp_read = csv.reader(life_expcsv, delimiter = '\n')
[life_exp.append(number) for row in exp_read for number in row]

Converting a string representation of dicts to an actual dict

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)

You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))

parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Converting values of named tuples from strings to integers

I'm creating a script to read a csv file into a set of named tuples from their column headers. I will then use these namedtuples to pull out rows of data which meet certain criteria.
I've worked out the input (shown below), but am having issues with filtering the data before outputting it to another file.
import csv
from collections import namedtuple
with open('test_data.csv') as f:
f_csv = csv.reader(f) #read using csv.reader()
Base = namedtuple('Base', next(f_csv)) #create namedtuple keys from header row
for r in f_csv: #for each row in the file
row = Base(*r)
# Process row
print(row) #print data
The contents of my input file are as follows:
Locus Total_Depth Average_Depth_sample Depth_for_17
chr1:6484996 1030 1030 1030
chr1:6484997 14 14 14
chr1:6484998 0 0 0
And they are printed from my code as follow:
Base(Locus='chr1:6484996', Total_Depth='1030',
Average_Depth_sample='1030', Depth_for_17='1030')
Base(Locus='chr1:6484997', Total_Depth='14',
Average_Depth_sample='14', Depth_for_17='14')
Base(Locus='chr1:6484998', Total_Depth='0', Average_Depth_sample='0',
Depth_for_17='0')
I want to be able to pull out only the records with a Total_Depth greater than 15.
Intuitively I tried the following function:
if Base.Total_Depth >= 15 :
print row
However this only prints the final row of data (from the above output table). I think the problem is twofold. As far as I can tell I'm not storing my named tuples anywhere for them to be referenced later. And secondly the numbers are being read in string format rather than as integers.
Firstly can someone correct me if I need to store my namedtuples somewhere.
And secondly how do I convert the string values to integers? Or is this not possible because namedtuples are immutable.
Thanks!
I previously asked a similar question with respect to dictionaries, but now would like to use namedtuples instead. :)

Map your values to int when creating the named tuple instances:
row = Base(r[0], *map(int, r[1:]))
This keeps the r[0] value as a string, and maps the remaining values to int().
This does require knowledge of the CSV columns as which ones can be converted to integer is hardcoded here.
Demo:
>>> from collections import namedtuple
>>> Base = namedtuple('Base', ['Locus', 'Total_Depth', 'Average_Depth_sample', 'Depth_for_17'])
>>> r = ['chr1:6484996', '1030', '1030', '1030']
>>> Base(r[0], *map(int, r[1:]))
Base(Locus='chr1:6484996', Total_Depth=1030, Average_Depth_sample=1030, Depth_for_17=1030)
Note that you should test against the rows, not the Base class:
if row.Total_Depth >= 15:
within the loop, or in a new loop of collected rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Unicode from Pandas Column Text - python

Related

Getting nested cells when using dictionaries in pandas (lasio for .LAS files)

How can I see a list of the variables in a CSV column?

Reading rows in CSV file and appending a list creates a list of lists for each value

Converting a string representation of dicts to an actual dict

Converting values of named tuples from strings to integers

Categories

Resources