Categories: json, apache, hadoop, hive

How to store hive query result in json format in a file?

1 answer

I want to store the hive query result to a file in JSON format. Through Brickhouse jar I can get the query output in JSON format but am unable to store that in a file or table. The query I'm trying is given below. When the INSERT OVERWRITE query runs, it gives an error; how can I solve this error? Is there a way to store query results in JSON format through queries?


add jar hdfs:///mydir/brickhouse-0.7.1.jar;  INSERT OVERWRITE DIRECTORY '/mydir/textfile1' stored as textfile SELECT to_json( named_struct( "id",id,             "name",name))    FROM link_tbl; 


INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: INSERT OVERWRITE DIRECTORY '/mydir/ INFO :  INFO : Status: Running (Executing on YARN cluster with App id application_1571318954298_0001)  INFO : Map 1: -/- ERROR : Status: Failed ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1571318954298_0001_1_00, diagnostics=[Vertex vertex_1571318954298_0001_1_00 [Map 1] killed/failed due to:INIT_FAILURE, Fail to create InputInitializerManager, org.apache.tez.dag.api.TezReflectionException: Unable to instantiate class with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator at org.apache.tez.common.ReflectionUtils.getNewInstance( at org.apache.tez.common.ReflectionUtils.createClazzInstance( at$ at$ at Method) at at at at at at$4300( at$InitTransition.handleInitEvent( at$InitTransition.transition( at$InitTransition.transition( at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition( at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition( at org.apache.hadoop.yarn.state.StateMachineFactory.access$300( at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition( at org.apache.tez.state.StateMachineTez.doTransition( at at at$VertexEventDispatcher.handle( at$VertexEventDispatcher.handle( at org.apache.tez.common.AsyncDispatcher.dispatch( at org.apache.tez.common.AsyncDispatcher$ at Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( at java.lang.reflect.Constructor.newInstance( at org.apache.tez.common.ReflectionUtils.getNewInstance( ... 25 more Caused by: java.lang.RuntimeException: Failed to load plan: hdfs:// java.lang.IndexOutOfBoundsException: Index: 19963874, Size: 113 Serialization trace: _mainHash (org.codehaus.jackson.sym.BytesToNameCanonicalizer) _rootByteSymbols (org.codehaus.jackson.JsonFactory) jsonFactory (brickhouse.udf.json.ToJsonUDF) genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork( at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork( at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.<init>( ... 30 more Caused by: java.lang.IndexOutOfBoundsException: Index: 19963874, Size: 113 Serialization trace: _mainHash (org.codehaus.jackson.sym.BytesToNameCanonicalizer) _rootByteSymbols (org.codehaus.jackson.JsonFactory) jsonFactory (brickhouse.udf.json.ToJsonUDF) genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) at at at at at at at at at at at at at at at at at at at at at at at at at at at at at at at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo( at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan( at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan( at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork( ... 32 more Caused by: java.lang.IndexOutOfBoundsException: Index: 19963874, Size: 113 at java.util.ArrayList.rangeCheck( at java.util.ArrayList.get( at at at at ... 65 more ] ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 

All answers to this question, which has the identifier 58434042

The best answer:

The solution can be creating table on top of this directory and use the power of JSONSerDe.

Create table:

CREATE EXTERNAL TABLE mydirectory_tbl(   id   string,   name string ) ROW FORMAT SERDE   '' LOCATION '/mydir' --this is HDFS/S3 location ; 

Insert data:

INSERT OVERWRITE table mydirectory_tbl SELECT id,name    FROM link_tbl; 

And you cannot specify filename in place of table or directory location. Only directory. If you want one single file, then you can concatenate files later (preferable as more performant) or force single reducer for example by adding ORDER BY id.

Last questions

how do i remove the switch on my home screen?
how to edit the JS date and time to update atuomatically?
How to utilize data stored in a multidimensional array
Powermockito not mocking URL constructor in URI.toURL() method
Android Bluetooth LE Scanner only scans when phone's Location is turned on in some devices
docker wordpress container can't connect to mysql container
How can I declare a number in java that is more than 64-bits? [duplicate]
Optaplanner solutionClass entityCollectionProperty should never return null error when simple JSON object passed to controller
Anylogic, get the time a pedestrain is in a queue
How do I fix this syntax issue with my .flex file?
Optimizing query in PHP
How to find the highest number of a column and print two columns of that row in R?
Ideas on “Error: Type is referenced as an interface from”?
JCIFS SmbFile.exists() and SmbFile.isDirectory() return false when it exists and I can listFiles()
PHP total order
Laravel booking system design
neural net - undefined column selected
How to indicate y axis does not start from 0 in ggplot?
Fragments in backStack
Spinner how to change the data