I expect the DataFrame to output in an 'Excel' type of fashion, but instead, get the index error:
'IndexError: too many indices for array'
import numpy as np
import pandas as pd
from numpy.random import randn
rowi = ['A', 'B', 'C', 'D', 'E']
coli = ['W', 'X', 'Y', 'Z']
df = pd.DataFrame(randn[5, 4], rowi, coli) # data , index , col
print(df)
How do I solve the problem?
Is this what you want:
df = pd.DataFrame(randn(5, 4), rowi, coli)
Out[583]:
W X Y Z
A -0.630006 -0.033165 -1.005409 -0.827504
B 0.044278 0.526636 1.082062 -1.664397
C 0.523847 -0.688798 -0.626712 0.149128
D 0.541975 -1.448316 -0.961484 -0.526547
E 0.066888 0.238089 1.180641 0.462298
Related
I am a total noob in coding, but am currently working on some stuff just to play around in python - which is really cool!
Can you help me figure out a way on how to avoid the "ValueError: All arrays must be of the same length" in case some partial data (e.g. price) is not availabe on a website I want to crawl? I can see that the data is not available through print(len) - for example the dataframe has then a len of 10,11,11, which causes the error, because the first is missing a row value. Everything else works just fine. For me it would be cool if the missing line could simply be filled with something like "-" or "Not available".
I tried reading a lot and am suffering a lot of trial and error, therefore, would be really glad if someone could help me out. Here is my code:
#add parser
page_source = driver.page_source
soup = BeautifulSoup(driver.page_source, 'html.parser')
#add scrape info
images = []
for img in soup.findAll('div', {'class': 'gridContent'}):
images.append(img.get('src'))
marke = [marke.text for marke in soup.findAll('span', {'class': 'ZZZ'})]
titel = [titel.text for titel in soup.findAll('h3', {'class': 'YYY'})]
preis = [preis.text for preis in soup.findAll('div', {'class': 'XXX'})]
#assign DF's
alle_daten = {'Zeitstempel:': timestamp_human, 'URL:': url, 'Marke:': marke, 'Titel:': titel, 'Preis:': preis}
df_all = pd.DataFrame(data=alle_daten)
df_scrape_all_clean = df_all.replace('\n', ' ',)
clean_stack = pd.concat([df_scrape_all_clean], axis=1)
df_all_urls = df_all_urls.append(df_all)
df_all_urls.to_excel("AAA.xlsx")
print(url)
I will give you an example with a case study to understand why this error comes and how to avoid it in the future.
Suppose we attempt to create the following pandas DataFrame:
import pandas as pd
#define arrays to use as columns in DataFrame
team = ['A', 'A', 'A', 'A', 'B', 'B', 'B']
position = ['G', 'G', 'F', 'F', 'G', 'G', 'F', 'F']
points = [5, 7, 7, 9, 12, 9, 9, 4]
#attempt to create DataFrame from arrays
df = pd.DataFrame({'team': team,
'position': position,
'points': points})
result:
ValueError: All arrays must be of the same length
We receive an error that tells us each array does not have the same length.
We can verify this by printing the length of each array:
#print length of each array
print(len(team), len(position), len(points))
result:
7 8 8
We see that the ‘team’ array only has 7 elements while the ‘position’ and ‘points’ arrays each have 8 elements.
How to Fix the Error
The easiest way to address this error is to simply make sure that each array we use has the same length:
import pandas as pd
#define arrays to use as columns in DataFrame
team = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
position = ['G', 'G', 'F', 'F', 'G', 'G', 'F', 'F']
points = [5, 7, 7, 9, 12, 9, 9, 4]
#create DataFrame from arrays
df = pd.DataFrame({'team': team,
'position': position,
'points': points})
#view DataFrame
df
team position points
0 A G 5
1 A G 7
2 A F 7
3 A F 9
4 B G 12
5 B G 9
6 B F 9
7 B F 4
Notice that each array has the same length this time.
Thus, when we use the arrays to create the pandas DataFrame we don’t receive an error because each column has the same length.
I would like to merge two data frames (how=left) but not only on an index but only on a condition.
E.g assume two data frame
C1 C2
A = I 3
K 2
L 5
C1 C2 C3
B = I 5 T
I 0 U
K 1 X
L 7 Z
Now I would like to left outer join table A with B using index C1 under the condition that A.C2 > B.C2. That is, the final result should look like
A.C1 A.C2 B.C2 B.C3
A<-B = I 1 0 U
K 2 1 X
L 5 Null Null
P.S.: If you want to test it your self:
import pandas as pd
df_A = pd.DataFrame([], columns={'C 1', 'C2'})
df_A['C 1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']
The quick and dirty solution would be to simply join on the C1 column and then put NULL or NaN into C3 for all the rows where C2_1 > C2_2.
Method: direct SQL query into pandas using pandasql library. reference
import pandas as pd
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K']
df_A['C2'] = [3, 2]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K']
df_B['C2'] = [5, 0, 2]
df_B['C3'] = ['T', 'U', 'X']
It appears to me that the conditions that you specified for doing the outer join on (A.C1 = B.C1) does not produce the expected result. I needed to do GROUP BY A.C1 in order to drop duplicate rows having same values in A.C1 after the join.
import pandasql as ps
q = """
SELECT A.C1 as 'A.C1',
A.C2 as 'A.C2',
B.C2 as 'B.C2',
B.C3 as 'B.C3'
FROM df_A AS A
LEFT OUTER JOIN df_B AS B
--ON A.C1 = B.C1 AND A.C2 = B.C2
WHERE A.C2 > B.C2
GROUP BY A.C1
"""
print(ps.sqldf(q, locals()))
Output
A.C1 A.C2 B.C2 B.C3
0 I 3 2 X
1 K 2 0 U
Other References
https://www.zentut.com/sql-tutorial/sql-outer-join/
Executing an SQL query over a pandas dataset
How to do a conditional join in python Pandas?
https://github.com/pandas-dev/pandas/issues/7480
https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e
I found one non-pandas-native solution:
import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
cols = df_A.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df_A.columns = cols
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']
# df_merge = pd.merge(left=df_A, right=df_B, how='left', on='C1')
df_sql = pysqldf("""
select *
from df_A t_1
left join df_B t_2 on t_1.C1 = t_2.C1 and t_1.C2 >= t_2.C2
;
""")
However, for big tables, pandasql turns out to be less performant.
Output:
C2 C1 C3 C2 C1
0 3 I U 0.0 I
1 2 K X 2.0 K
2 5 L None NaN None
I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?
The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2
The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-01-2018', '03-04-2018', '03-05-2018', \
'03-08-2018', '03-12-2018']
df['ColN_ext']=['A', 'B', 'B', 'B', 'B']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
I am trying to solve the following problem based on above DataFrame:
within a window of (say) 5 days, I want to check if ColN_ext values are appearing before and after a particular row by group ColN .
i.e. I am trying to create a flag:
df['flag'] = [NaN, 0, 1, NaN, NaN] . Any help would be appreciated.
I was able to do this by defining custom function:
import numpy as np
import pandas as pd
flag_list = []
def create_flag(dt, lookupdf):
stdt = dt - lkfwd
enddt = dt + lkfwd
bckset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] >= stdt) & \
(lookupdf['ColN_dt'] < dt)]['ColN_ext'])
fwdset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] > dt) & \
(lookupdf['ColN_dt'] <= enddt)]['ColN_ext'])
flag_list.append(bool(bckset_ext.intersection(fwdset_ext)))
return None
# Define the rolling days
lkfwd = pd.Timedelta(days=5)
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-12-2018', '03-13-2018', '03-13-2018', '03-01-2018', '03-05-2018', '03-04-2018', '03-08-2018', '02-04-2018']
df['ColN_ext']=['A', 'B', 'A', 'A', 'B', 'B', 'C', 'A']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
dfs = df.sort_values(by=['ColN', 'ColN_dt']).reset_index(drop=True)
dfg = dfs.groupby('ColN')
for _, grpdf in dfg:
grpdf['ColN_dt'].apply(create_flag, args=(grpdf,))
dfs['flag'] = flag_list
This generates:
dfs['flag'] = [False, False, False, True, False, False, False, False]
I am now trying to achieve the same using pandas.groupby + rolling + (may be) resample
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.random(36).reshape((9,4)), index = np.arange(1, np.random.random(36).reshape((9,4)).shape[0]+1), columns=['A', 'B', 'C', 'D'])
te = pd.DataFrame(data = np.random.randint(low=1, high=10, size=1000), columns=['Number_test'])
I want to concatenate two dataframes by index of df to each corresponding element in column Number_test
Use pandas.DataFrame.merge:
pd.merge(df, te, left_index=True, right_on='Number_test')
or
pd.merge(df.reset_index(), te, left_on='index', right_on='Number_test')