How do I separate data into different variable in pandas

How do I separate data into different variable in pandas - python

Hey guys I am have an data that looks like this train.dat . I am trying to create an varible that will contain the [ith] value of the column containing(-1,or 1), and another variable to hold the value of column that have strings.
So far I have tried this,
df=pd.read_csv("train.dat",delimiter="\t", sep=',')
# print(df.head())
# separate names from classes
vals = df.ix[:,:].values
names = [n[0][3:] for n in vals]
cls = [n[0][0:] for n in vals]
print(cls)
However the output looks all jumbled up, any help would be appreciated. I am a begineer in python

If the character after the numerical value is a tab, you're fine and all you would need is
import io # using io.StringIO for demonstration
import pandas as pd
ratings = "-1\tThis movie really sucks.\n-1\tRun colored water through
a reflux condenser and call it a science movie?\n+1\tJust another zombie flick? You'll be surprised!"
df = pd.read_csv(io.StringIO(ratings), sep='\t',
header=None, names=['change', 'rating'])
Passing header=None makes sure that the first line is interpreted as data.
Passing names=['change', 'rating'] provides some (reasonable) column headers.
Of course, the character is not a tab :D.
import io # using io.string
import pandas as pd
ratings = "-1 This movie really sucks.\n-1 Run colored water through a
reflux condenser and call it a science movie?\n+1 Just another zombie
flick? You'll be surprised!"
df = pd.read_csv(io.StringIO(ratings), sep='\t',
header=None, names=['stuff'])
df['change'], df['rating'] = df.stuff.str[:3], df.stuff.str[3:]
df.drop('stuff', axis=1)
One viable option is to read in the whole rating as one temporary column, split the string, distribute it to two columns and eventually drop the temporary column.

Related

Why is Pandas' whitespace delimiter skipping one of my values?

I'm currently trying to use Python read a text file into Sqlite3 using Pandas. Here are a few entries from the text file:
1 Michael 462085 2.2506 Jessica 302962 1.5436
2 Christopher 361250 1.7595 Ashley 301702 1.5372
3 Matthew 351477 1.7119 Emily 237133 1.2082
The data consists of popular baby names, and I have to separate male names and female names into their own tables and perform queries on them. My method consists of first placing all the data into both tables, then dropping the unneeded columns afterwards. My issue is that when I try to add names to the columns, I get a value error: The expected axis has 6 elements, but 7 values. I'm assuming it's because Pandas possibly isn't reading the last values of each line, but I can't figure out how to fix it. My current delimiter is a whitespace delimiter that you can see below.
Here is my code:
import sqlite3
import pandas as pd
import csv
con = sqlite3.connect("C:\\****\\****\\****\\****\\****\baby_names.db")
c=con.cursor()
# Please note that most of these functions will be commented out, because they will only be run once.
def create_and_insert():
# load data
df = pd.read_csv('babynames.txt', index_col=0, header=None, sep= '\s+', engine = 'python')
# Reading the textfile
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber', 'Girlpercent']
# Adding Column names
df.columns = df.columns.str.strip()
con = sqlite3.connect("*************\\baby_names.db")
# drop data into database
df.to_sql("Combined", con)
df.to_sql("Boys", con)
df.to_sql("Girls", con)
con.commit()
con.close()
create_and_insert()
def test():
c.execute("SELECT * FROM Boys WHERE Rank = 1")
print(c.fetchall())
test()
con.commit()
con.close()
I've tried adding multiple delimiters, but it didn't seem to do anything. Using just regular space as the delimiter seems to just create 'blank' column names. From reading the Pandas docs, it says that multiple delimiters are possible, but I can't quite figure it out. Any help would be greatly appreciated!

Note that:
your input file contains 7 columns,
but the initial column is set as the index (you passed index_col=0),
so your DataFrame contains only 6 regular columns.
Print df to confirm it.
Now, when you run df.columns = ['Rank', ...], you attempt to assing the
7 passed names to existing 6 data columns.
Probably you should:
read the DataFrame without setting the index (for now),
assign all 7 column names,
set Rank column as the index.
The code to do it is:
df = pd.read_csv('babynames.txt', header=None, sep='\s+', engine='python')
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent']
df.set_index('Rank', inplace=True)
Or even shorter (all in one):
df = pd.read_csv('babynames.txt', sep='\s+', engine='python',
names=['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent'], index_col='Rank')

How to search for a string in columns of a CSV file

I'm trying to create a code that will search for a desired string in the columns of my data frame. For instance, I want to put into a data frame all the companies with "general" in the name and later (separate problem) that begin with a "T". For the first issue, I have the following code:
import pandas as pd
import csv
Forbes = pd.read_csv('Forbes2000.csv')
pd.set_option('precision', 2)
Forbes.columns=['#','Rank','Name','Country','Category','Sales','Profits','Assets','Marketvalue',]
for item in lines:
if 'General' in Forbes["Name"]:
Forbes.head()
This doesn't really return much of anything. I get "NameError: name 'lines' is not defined." I've tried something like the following:
import pandas as pd
import csv
Forbes = pd.read_csv('Forbes2000.csv')
pd.set_option('precision', 2)
Forbes.columns=['#','Rank','Name','Country','Category','Sales','Profits','Assets','Marketvalue',]
Forbes[ Forbes["Name"] == "General"].head()
This returns nothing, which I'm lead to believe happens because python is searching for an item in "Name" that is entirely called "General" instead of just searching for its appearance. What can I get to have python print all the companies with "General" in the name, such as "General Motors" or "General Electric" from my list? This is a somewhat separate problem, but from there, how would I print all companies that begin with the letter "T"?

import pandas as pd
df = pd.DataFrame(['General motor','abc','General Electric','xyz'], columns = ['name'])
df[df['name'].str.contains('General')]
#op
name
0 General motor
2 General Electric

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)

I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Selecting and importing only certain columns from excel for importing

'I have an excel file which contains many columns with strings, but i want to import certain columns of this excel file containing 'NGUYEN'.
I want to generate a string from columns in my excel which had 'NGUYEN' in them.
import pandas as pd
data = pd.read_excel("my_excel.xlsx", parse_cols='NGUYEN' in col for cols in my_excel.xlsx, skiprows=[0])
data = data.to_string()
print(data)
SyntaxError: invalid syntax
my_excel.xlsx
Function output should be
data = 'NGUYEN VIETNAM HANOIR HAIR PANTS BIKES CYCLING ORANGE GIRL TABLE DARLYN NGUYEN OMG LOL'

I'm pretty sure this is what you are looking for. I tried making it as simple and compact as possible, if you need help making a more readable multi-line function. Let me know!
import pandas as pd
data = pd.read_excel("my_excel.xlsx")
getColumnsByContent = lambda string: ' '.join([' '.join([elem for elem in data[column]]) for column in data.columns if string in data[column].to_numpy()])
print(getColumnsByContent('NGUYEN'))
print(getColumnsByContent('PANTS'))

Read CSV file that needs data sanitization prior to loading into dataframe

I'm reading a CSV file into pandas. The issue is that the file needs removal of rows and calculated values on the other rows. My current idea starts like this
with open(down_path.name) as csv_file:
rdr = csv.DictReader(csv_file)
for row in rdr:
type = row['']
if type == 'Summary':
current_ward = row['Name']
else:
name = row['Name']
count1 = row['Count1']
count2 = row['Count2']
count3 = row['Count3']
index_count += 1
# write to someplace
,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0
The end result needs to end up in a dataframe that i can concatenate to an existing dataframe.
Braindead way to do this is simply do my conversions and create a new CSV file, then read that in. Seems like a non-pythonic way to go.
Need to take out the summary lines, combine those with similar names (Aloha 1 and Aloha I), remove the individual stat info and put the Aloha 1 label on each of the individuals. Plus i need to add which month this data is from. As you can see the data needs some work :)
desired output would be
Jan-16, Aloha 1, John, 1,2,3
Where the Aloha 1 comes from the summary line above it

My personal preference would be to do everything in Pandas.
Perhaps something like this...
# imports
import numpy as np
import pandas as pd
from StringIO import StringIO
# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))
# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)
# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'
# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)
# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)
# get rid of the ward summary lines
df = df.ix[~ws_mask]
# get rid of the Desc column
df.drop('Desc', axis=1)
Yes; you pass over the data more than once, so you could potentially do better with a smarter single pass algorithm. But, if performance isn't your main concern, I think this has benefits in terms of conciseness and readability.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I separate data into different variable in pandas - python

Related

Why is Pandas' whitespace delimiter skipping one of my values?

How to search for a string in columns of a CSV file

Parsing Dirty Text File with Pandas Header Issue

Selecting and importing only certain columns from excel for importing

Read CSV file that needs data sanitization prior to loading into dataframe

Categories

Resources