Issue importing parquet


#1

Hi,

I am getting errors trying to import parquet library. What library should I use ?

Error:
import pyarrow.parquet as pq
RuntimeError: module compiled against API version a but this version of numpy is 7
Traceback (most recent call last):
File “”, line 1, in
File “/usr/lib64/python2.7/site-packages/pyarrow/init.py”, line 32, in
from pyarrow.lib import cpu_count, set_cpu_count
File “pyarrow/lib.pyx”, line 40, in init pyarrow.lib (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:97556)
ImportError: numpy.core.multiarray failed to import

Thanks !!!


#2

@juan_zapata,
First need to install pyarrow with pip or conda.
Then, need to use the below command
import pyarrow.parquet as pq


#3

Thanks.

I mean in the lab, I want to save a dataframe as parquet file and compress it but I am getting errors. I can not install any library on the lab.

Error trying to import library:

import pyarrow.parquet as pq
RuntimeError: module compiled against API version a but this version of numpy is 7
Traceback (most recent call last):
File “”, line 1, in
File “/usr/lib64/python2.7/site-packages/pyarrow/init.py”, line 32, in
from pyarrow.lib import cpu_count, set_cpu_count
File “pyarrow/lib.pyx”, line 40, in init pyarrow.lib (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:97556)
ImportError: numpy.core.multiarray failed to import

Error due to privileges on enviroment:
pip install pyarrow
Requirement already satisfied: pyarrow in /usr/lib64/python2.7/site-packages (0.8.0)
Collecting futures (from pyarrow)
Using cached https://files.pythonhosted.org/packages/2d/99/b2c4e9d5a30f6471e410a146232b4118e697fa3ffc06d6a65efde84debd0/futures-3.2.0-py2-none-any.whl
Requirement already satisfied: six>=1.0.0 in /usr/lib/python2.7/site-packages (from pyarrow) (1.9.0)
Collecting numpy>=1.10 (from pyarrow)
Using cached https://files.pythonhosted.org/packages/85/51/ba4564ded90e093dbb6adfc3e21f99ae953d9ad56477e1b0d4a93bacf7d3/numpy-1.15.0-cp27-cp27mu-manylinux1_x86_64.whl
boto3 1.6.3 requires botocore<1.10.0,>=1.9.3, which is not installed.
boto3 1.6.3 requires jmespath<1.0.0,>=0.7.1, which is not installed.
boto3 1.6.3 requires s3transfer<0.2.0,>=0.1.10, which is not installed.
watchdog 0.8.3 requires PyYAML>=3.10, which is not installed.
ipapython 4.5.0 requires pyldap>=2.4.15, which is not installed.
oauth2client 4.1.2 requires rsa>=3.1.4, which is not installed.
mrjob 0.6.1 requires botocore>=1.6.0, which is not installed.
mrjob 0.6.1 requires PyYAML>=3.08, which is not installed.
rtslib-fb 2.1.63 has requirement pyudev>=0.16.1, but you’ll have pyudev 0.15 which is incompatible.
ipapython 4.5.0 has requirement dnspython>=1.15, but you’ll have dnspython 1.12.0 which is incompatible.
Installing collected packages: futures, numpy
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: ‘/usr/lib/python2.7/site-packages/concurrent’
Consider using the --user option or check the permissions.

I am able to save dataframe as parquet but this method does not support compression:

ordersJoined.write.parquet("/user/juan_zapata/problem1/result4a-gzip",mode=‘overwrite’)

What should I do to save a file as a compressed parquet file ?

Thanks.


#4

@juan_zapata Launch pyspark and try below command:

orders = sqlContext.read("/public/retail_db/orders")
sqlContext.setConf(“spark.sql.parquet.compression.codec”, “snappy”)
orders.write.format(“parquet”).save("/user/username/parquetdemo")