python - How to read table with chunksize and names - python

how can i read data from a csv with chnunksize and names?
I tried this:
sms = pd.read_table('demodata.csv', header=None, names=['label', 'good'])
X = sms.label.tolist()
y = sms.good.tolist()
and it worked totaly fine. But if try this, i'll get an error:
sms = pd.read_table('demodata.csv', chunksize=100, header=None, names=['label', 'good'])
X = sms.label.tolist()
y = sms.good.tolist()
And i get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-e3f35149ab7f> in <module>()
----> 1 X = sms.label.tolist()
2 y = sms.good.tolist()
AttributeError: 'TextFileReader' object has no attribute 'label'
Why does it work in the first but not in the second place?

Related

Create DataFrame having trouble

I'm trying to convert views_dict[2017] to a dataframe. Using jupyter notebook. If I use variable of views_dict I have numerous years in dictionary format.
views_dict[2017]
[2102206,
1331781,
925375,
382331,
321960,
278439,
231613,
206570,
179082,
173855,
137089,
123836,
122077,
120140,
114837,
108279,
103176,
93963,
79388,
72907]
df = pd.DataFrame(list(zip(views_dict)), columns = ['views'])
df
NameError Traceback (most recent call last)
Input In [27], in <cell line: 1>()
----> 1 df = pd.DataFrame(list(zip(views_dict)), columns = ['views'])
2 df
NameError: name 'pd' is not defined

TypeError: expected string or bytes-like object in Pandas

I want to tokenize text, but couldn't. How can I solve this?
Here is my problem:
#read_text from file
data = pd.read_csv("input data.txt",encoding = "UTF-8")
print(data)
Output: Bangla text
t = Tokenizers()
print(t.bn_word_tokenizer(data))
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-f9f299ecf33d> in <module>
1 `t = Tokenizers()`
----> 2 `print(t.bn_word_tokenizer(dataStr))`
D:\anaconda\lib\site-packages\bnltk\tokenize\bn_word_tokenizers.py in bn_word_tokenizer(self, input_)
15 `tokenize_list` = []
16 `r = re.compile(r'[\s\ред{}]+'.format(re.escape(punctuation)))`
---> 17 `list_ = r.split(input_)`
18 `list_ = [i for i in list_ if i`]
19 `return list_`
TypeError: expected string or bytes-like object
Try this:
for column in data:
a = data.apply(lambda row: t.bn_word_tokenizer(row), axis=1)
print(a)
This will print one column at a time. If you want to convert the entire dataframe rather than just print then replace a with data[column] in the code above.

In what way can i debug this attribute error in python?

I'm trying to follow a Regression tutorial for python as the stats model package does not seem to be working for me. So I got this far until I received an attribute error.
input:
import pandas as pd
data = pd.read_csv("China_FDIGDP.csv")
data1 = data.dropna()
data1.to_csv("data1.csv", index = False)
Data = pd.read_csv("data1.csv")
print(Data)
x = pd.Data["GDP"].values()
y = pd.Data["FDI_net_in"].values()
Here's the output:
Traceback (most recent call last):
File "FDI.py", line 20, in <module>
x = pd.Data["GDP"].values()
AttributeError: module 'pandas' has no attribute 'Data'
What am I doing wrong?
Date FDI_net_in GDP
0 1982 4.300000e+08 2.050897e+11
1 1983 6.360000e+08 2.306867e+11
2 1984 1.258000e+09 2.599465e+11
3 1985 1.659000e+09 3.094880e+11
4 1986 1.875000e+09 3.007581e+11
Index(['Date', 'FDI_net_in', 'GDP '], dtype='object')
The error comes from these lines
x = pd.Data["GDP"].values()
y = pd.Data["FDI_net_in"].values()
You have read the dataframe like Data = pd.read_csv("data1.csv") so in order to get the GDP column from it you just want to access it like this:
x = Data["GDP"].values
y = Data["FDI_net_in"].values
Try this
Data.columns = Data.columns.str.strip(' ') # remove tab spaces in column names
x = Data["GDP"].values
y = Data["FDI_net_in"].values
Change the file name, if your file name is pandas.py or pd.py as it may cause some errors with the pandas library.

pyspark: type object 'Row' has no attribute 'fromSeq'

I have the following code:
from pyspark.sql import Row
z1=["001",1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,30,41,42,43]
print z1
r1 = Row.fromSeq(z1)
print (r1)
Then I got error:
AttributeError Traceback (most recent call last)
<ipython-input-6-fa5cf7d26ed0> in <module>()
2 z1=["001",1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,30,41,42,43]
3 print z1
----> 4 r1 = Row.fromSeq(z1)
5
6 print (r1)
AttributeError: type object 'Row' has no attribute 'fromSeq'
Anyone know what I might have missed? Thanks!
If you don't provide names just use tuple:
tuple(z1)
This is all what is needed to build correct DataFrame

I ran dropna in dataframe but got an error message

I ran this statement dr=df.dropna(how='all') to remove missing values and got the error message shown below:
AttributeError Traceback (most recent call last)
<ipython-input-29-07367ab952bc> in <module>
----> 1 dr=df.dropna(how='all')
AttributeError: 'list' object has no attribute 'dropna'
According to pdf https://www.google.com/url?sa=t&source=web&rct=j&url=https://readthedocs.org/projects/tabula-py/downloads/pdf/latest/&ved=2ahUKEwiKr-mQ9qTnAhUKwqYKHcAtAcoQFjADegQIBRAB&usg=AOvVaw32D890VNjAq5wOkTo4icOi&cshid=1580168098808
df = tabula.read_pdf(file, lattice=True, pages='all', area=(1, 1, 1000, 100), relative_area=True)
pages='all' => probably return a list of Dataframe
So you have to check:
for sub_df in df:
dr=sub_df.dropna(how='all')

Categories