Win10怎么搭建Pyspark2.4.4+Pycharm开发环境
这篇文章主要介绍“Win10怎么搭建Pyspark2.4.4+Pycharm开发环境”,在日常操作中,相信很多人在Win10怎么搭建Pyspark2.4.4+Pycharm开发环境问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答”Win10怎么搭建Pyspark2.4.4+Pycharm开发环境”的疑惑有所帮助!接下来,请跟着小编一起来学习吧!
下载资源
hadoop3.0.0
spark-2.4.4-bin-without-hadoop
winutils下载(对应hadoop3.0.1的bin目录覆盖本地hadoop的bin目录)
jdk1.8(默认已按照配置)
conda/anaconda(默认已安装)
注意:cdh7.3.2的spark为2.4.0但是使用2.4.0本地pyspark有bug,下载的文件可能在第一次解压缩后,如未出现目录,则需要修改文件后缀为zip,再次解压缩
python环境(推荐cmd非powershell)
spark2.4.x不支持python3.7以上版本
conda create -n pyspark2.4 python=3.7activate pyspark2.4pip install py4jpip install psutil
pyspark安装方法(推荐一)
%SPARK_HOME%\python\pyspark目录复制到%CONDA_HOME%\pyspark2.4\Lib\site-packages下
pip install pyspark=2.4.4
配置环境变量(自行百度)
以下只是示例,根据实际情况修改,路径不要有空格,如果有使用mklink /J 软链接 目录路径
系统变量添加HADOOP_HOMEE:\bigdata\ENV\hadoop-3.0.0 SPARK_HOMEE:\bigdata\ENV\spark-2.4.4-bin-without-hadoop PYSPARK_PYTHONC:\Users\zakza\anaconda3\envs\pyspark2.4\python.exe PATH添加%HADOOP_HOME%\bin%SPARK_HOME%\bin
修改配置文件
配置一 %SPARK_HOME%\conf目录下新建spark-env.cmd文件,内容如下
FOR /F %%i IN ('hadoop classpath') DO @set SPARK_DIST_CLASSPATH=%%i
配置二 %SPARK_HOME%\conf\目录下新建log4j.properties文件,内容如下
## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# # Set everything to be logged to the consolelog4j.rootCategory=WARN, consolelog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to WARN. When running the spark-shell, the# log level for this class is used to overwrite the root logger's log level, so that# the user can have different defaults for the shell and regular Spark apps.log4j.logger.org.apache.spark.repl.Main=WARN # Settings to quiet third party logs that are too verboselog4j.logger.org.spark_project.jetty=WARNlog4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERRORlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFOlog4j.logger.org.apache.parquet=ERRORlog4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive supportlog4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATALlog4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
配置Pycharm
注意:配置好环境变量重启下电脑,不然可能存在pycharm无法加载系统环境变量的情况
wc.txt
hello hadoophadoop spark pythonflink storm sparkmaster slavefirstsecond thridkafka scikit-learnflumehive spark-streaminghbase
wordcount测试代码
from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext('local', 'WordCount') textFile = sc.textFile("wc.txt") wordCount = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey( lambda a, b: a + b) wordCount.foreach(print)
正常运行结果:
常见问题:
spark-shell报错Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
解决方法:见上述配置一
Pyspark报错ModuleNotFoundError: No module named 'resource'
解决方法:spark2.4.0存在的bug,使用spark2.4.4
Pyspark报错org.apache.spark.sparkexception: python worker failed to connect back
解决方法:环境变量未配置正确,检查是否遗漏,并检查pycharm的configuration的环境变量里面能够看到
其他
关于%SPARK_HOME%\python\lib下的py4j-0.10.7-class="lazy" data-src.zip,pyspark.zip(未配置运行正常),也可以尝试添加到项目
到此,关于“Win10怎么搭建Pyspark2.4.4+Pycharm开发环境”的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注编程网网站,小编会继续努力为大家带来更多实用的文章!
免责声明:
① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。
② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341