Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use Cases for Apache Spark often are related to machine/deep learning, graph processing.

Filter by
Sorted by
Tagged with
0
votes
0answers
7 views

How to remove special characters from dataframe using udf function

I am a learner in spark sql. Could anyone please help with below scenario? package name: sparksql,class name:custommethod, method name:removespecialchar create custom method in scala which takes 1 ...
0
votes
0answers
6 views

Problem in creating connection with remote spark master

I am facing a problem with my Java program. It is not able to make a connection with the Master node of the Standalone spark cluster. I am trying my spark to read the data from Kafka topic. I have ...
0
votes
0answers
7 views

Spark event logs for long running process

I have a long running spark process and could see event logs generated taking huge space.When I investigated logs I could see apart from regular job, stage, task event logs I can see several logs like ...
0
votes
0answers
6 views

Ranking in Spark Structured Streaming

I have a data stream that consists of data like this. {Student, Class, CurrentScore} I want to use a sliding window to calculate the statistic of these events: spark.readStream(...).withColumn("...
1
vote
1answer
18 views

In PySpark, using regexp_replace, how to replace a group with value from another column? [duplicate]

I have a dataframe with two columns: filename and year. I want to replace the year value in filename with value from year column Third column in the below table demonstrates the requirement: +---------...
1
vote
2answers
23 views

flatten Options inside Spark RDD

I'm trying to reproduce the behaviour of scala vanilla collections' flatten with Option on a Spark RDD. For example: Seq(Some(1), None, Some(2), None, None).flatten > Seq[Int] = List(1, 2) // None ...
0
votes
1answer
22 views

Spark - Drop null values from map column

I'm using Spark to read a CSV file and then gather all the fields to create a map. Some of the fields are empty and I'd like to remove them from the map. So for a CSV that looks like this: "...
1
vote
0answers
13 views

Efficient filtering of a large dataset in Spark based on other dataset

I have an issue with filtering a large dataset, which I believe is because of an inefficient join. What I'm trying to do is the following: Dataset info contains a lot of user data, identified by a ...
0
votes
0answers
14 views

Is there an efficient way in spark to query an API?

In pyspark, I have a dataframe with a bit more than 4 millions of rows. I add a column to the dataframe with the withColumn function. The value of the column for each row is defined in a UDF. The UDF ...
0
votes
0answers
11 views

Spark task fails to write rows into ORC table

I run the following code for a spatial join on geometry fields: val coverage = DimCoverageReader.apply(spark, params) coverage.createOrReplaceTempView("dim_coverage") val ...
-1
votes
0answers
19 views

How to remove last occurrence in spark dataframe column

I am having the below value in the column and How to remove last occurrence in scala daataframe column Input test\data\spark\test.csv Output test\data\spark
0
votes
0answers
12 views

pyspark joinWithCassandraTable refactor without maps

Im new to using spark/scala here and im having trouble with a refactor of some of my code here. Im running Scala 2.11 using pyspark and in a spark/yarn setup. The following is working but id like to ...
0
votes
0answers
18 views

How to mock a function in scala?

object ReadUtils { def readData(sqlContext: SQLContext, fileType: FileType.Value): List[DataFrame] = { //some logic } I am writing test for the execute function import com.utils....
0
votes
0answers
20 views

Facing Issue while using Spark-BigQuery-Connector with Java

I am able to read the data from BigQuery table via spark big query connector from local, but when I deploy this in Google Cloud and running via dataproc, I am getting below exception.If you see the ...
0
votes
1answer
28 views

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date. Here is what I have got so far: I write the ...

15 30 50 per page
1
2 3 4 5
4267