Apache Hadoop

来自开放百科 - 灰狐
(版本间的差异)
跳转到: 导航, 搜索
(Tutorial)
(图集)
(未显示1个用户的38个中间版本)
第17行: 第17行:
 
[[Image:Carina-Pillar-680x100.jpg|船底星座星云图|right]]
 
[[Image:Carina-Pillar-680x100.jpg|船底星座星云图|right]]
 
==新闻==
 
==新闻==
{{SeeWikipedia}}
+
*[http://www.infoq.com/cn/articles/review-and-prospec-of-big-data-technology 大数据技术的回顾与展望 ——写在Hadoop十周年纪念]
<rss>http://developer.yahoo.com/blogs/hadoop/feed/rss2/|short|date|max=5</rss>
+
 
<rss>http://www.cloudera.com/feed/|short|date|max=5</rss>
+
==简介==
 +
Hadoop Becomes the New core of the Analytical enterprise.
 +
 
 +
Apache Hadoop 是一种用于分布式存储和处理商用硬件上大型数据集的开源架构。Hadoop 可使企业迅速从海量的结构化和非结构化数据中获取洞察力。
 +
 
 +
==Hadoop的12个事实==
 +
[http://www.searchbi.com.cn/showcontent_62856.htm 分析师给出关于Hadoop的12个事实]
 +
*事实1:Hadoop是由多个产品组成的。
 +
*事实2:Apache Hadoop是开源技术,但专有厂商也提供Hadoop产品。
 +
*事实3:Hadoop是一个生态系统,而非一个产品。
 +
*事实4:HDFS是文件系统,而不是数据库管理系统。
 +
*事实5:Hive与SQL类似,却非标准SQL。
 +
*事实6:Hadoop与MapReduce相互关联,但不相互依赖。
 +
*事实7:MapReduce提供的是对分析的控制,而不是分析本身。
 +
*事实8:Hadoop的意义不仅仅在于数据量,更在于数据的多样化。
 +
*事实9:Hadoop是数据仓库的补充,不是数据仓库的替代品。
 +
*事实10:Hadoop不仅仅是Web分析。
 +
*事实11:大数据不一定非Hadoop不可。
 +
*事实12:Hadoop不是“免费午餐”。
 +
 
 +
==项目==
 +
*[https://github.com/onurakpolat/awesome-bigdata Awesome Big Data] [[image:awesome.png]]
 +
*[https://github.com/youngwookim/awesome-hadoop Awesome Hadoop] [[image:awesome.png]]
 +
*[[Apache Avro]] 数据序列化
 +
*[[Apache Cassandra]] 数据库
 +
*[[Apache Chukwa]] 数据采集系统
 +
*[[Apache HBase]] 数据库
 +
*[[Apache Hive]] 数据仓库
 +
*[[Apache Mahout]] 机器学习和数据挖掘
 +
*[[Apache Pig]] 数据流语言
 +
*[[Apache ZooKeeper]] 分布式协调服务
  
 
==Quick Start==
 
==Quick Start==
第70行: 第100行:
 
  $ bin/stop-all.sh
 
  $ bin/stop-all.sh
 
[[文件:hadoop-logo.gif|right]]
 
[[文件:hadoop-logo.gif|right]]
 +
 
==Tutorial==
 
==Tutorial==
 
*[http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python Writing An Hadoop MapReduce Program In Python]
 
*[http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python Writing An Hadoop MapReduce Program In Python]
第78行: 第109行:
 
*[http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant Python + Hadoop = Flying Circus Elephant]
 
*[http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant Python + Hadoop = Flying Circus Elephant]
 
*[[Dumbo]]
 
*[[Dumbo]]
 +
==Projects==
 +
*[http://cloudstory.in/2012/04/hive-for-retail-analysis/ Hive for Retail Analysis]
 +
 
==Developer==
 
==Developer==
 
*[[IBM MapReduce Tools for Eclipse]]
 
*[[IBM MapReduce Tools for Eclipse]]
 +
 
==Yahoo==
 
==Yahoo==
 
[[Image:hadoop-yahoo-distribution.gif|right|Yahoo! Distribution of Hadoop]]
 
[[Image:hadoop-yahoo-distribution.gif|right|Yahoo! Distribution of Hadoop]]
Yahoo! Distribution of Hadoop http://developer.yahoo.com/hadoop/
+
*[http://yahoohadoop.tumblr.com/ Hadoop at Yahoo]
 +
*Yahoo! Distribution of Hadoop http://developer.yahoo.com/hadoop/
 +
 
 
==Powered By==
 
==Powered By==
 
[[Image:Yahoo-Hadoop-Clusters.png|right|thumb|Yahoo Hadoop Clusters]]
 
[[Image:Yahoo-Hadoop-Clusters.png|right|thumb|Yahoo Hadoop Clusters]]
第89行: 第126行:
 
*http://www.yahoo.com/
 
*http://www.yahoo.com/
 
>5000 nodes running Hadoop as of July 2007, biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each), Used to support research for Ad Systems and Web Search. Also used to do scaling tests to support development of Hadoop on larger clusters
 
>5000 nodes running Hadoop as of July 2007, biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each), Used to support research for Ad Systems and Web Search. Also used to do scaling tests to support development of Hadoop on larger clusters
 +
*[http://www.csdn.net/article/2015-10-27/2826054 eBay的Connected Commerce大数据平台实践]
 
*http://www.last.fm/
 
*http://www.last.fm/
 
25 node cluster (dual xeon LV, 1TB/node storage), Used for charts calculation and web log analysis
 
25 node cluster (dual xeon LV, 1TB/node storage), Used for charts calculation and web log analysis
第96行: 第134行:
 
*[[Cascading]]
 
*[[Cascading]]
 
*[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss Blue Cloud Computing Clusters]   
 
*[http://www-03.ibm.com/press/us/en/pressrelease/22613.wss Blue Cloud Computing Clusters]   
 +
*[http://www.infoq.com/cn/articles/hadoop-ten-years-part01 专访阿里王峰:Hadoop生态下一代计算引擎-streaming和batch的统一]
 
*[http://www.koubei.com/ 口碑网]  
 
*[http://www.koubei.com/ 口碑网]  
 
Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
 
Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
  
 
More: http://wiki.apache.org/hadoop/PoweredBy
 
More: http://wiki.apache.org/hadoop/PoweredBy
 +
 +
==发行版==
 +
*[[Hortonworks]] Data Platform (HDP)
 +
*[[Cloudera]]
 +
*[[Mapr]]
 +
*[http://www.intel.cn/content/www/cn/zh/big-data/intel-distribution-of-hadoop.html Apache Hadoop 软件的英特尔分发版]
 +
*[[Apache Bigtop]]构建Hadoop发行版
 +
 +
==文档==
 +
*[http://docs.huihoo.com/apache/apachecon/us2015/Apache-Mesos+Apache-YARN=Myriad.pdf Apache Mesos + Apache YARN = Myriad]
 +
*[http://docs.huihoo.com/oreilly/conferences/strataconf/big-data-conference-ny-2014/From-Oracle-to-Hadoop.pptx From Oracle to Hadoop]
 +
*[http://docs.huihoo.com/apache/apachecon/us2014/Apache-Hadoop-YARN-The-Next-generation-Distributed-Operating-System.pdf Apache Hadoop YARN: The Next generation Distributed Operating System]
 +
*[http://docs.huihoo.com/javaone/2014/BOF3725-Text-Processing-with-Hadoop-and-Mahout-Key-Concepts-for-Distributed-NLP.pdf Text Processing with Hadoop and Mahout: Key Concepts for Distributed NLP]
 +
*[http://docs.huihoo.com/oracle/openworld/2014/THT11268-Hadoop-2-Cluster-with-Oracle-Solaris-Zones-ZFS-and-Unified-Archives.pptx Hadoop 2 Cluster with Oracle Solaris Zones, ZFS, and Unified Archives]
 +
*[http://docs.huihoo.com/oreilly/conferences/strataconf/big-data-conference-ny-2013/An-Introduction-to-Real-Time-Analytics-with-Cassandra-and-Hadoop.pdf An Introduction to Real-Time Analytics with Cassandra and Hadoop]
 +
 
==图集==
 
==图集==
 
<gallery widths=100px heights=100px perrow=6>
 
<gallery widths=100px heights=100px perrow=6>
 +
image:Features-of-Hadoop-3.0.png|Hadoop 3.0
 +
image:big-data.jpg|大数据
 
Image:apache-hadoop-architecture.gif|Architecture
 
Image:apache-hadoop-architecture.gif|Architecture
 
Image:apache-hadoop-ecosystem.png|Ecosystem
 
Image:apache-hadoop-ecosystem.png|Ecosystem
 
Image:hdfsarchitecture.png|HDFS Architecture
 
Image:hdfsarchitecture.png|HDFS Architecture
 
Image:hdfsdatanodes.png|Data Replication
 
Image:hdfsdatanodes.png|Data Replication
 +
image:hadoop-modern-data-architecture.png|Hadoop数据架构
 +
image:apache-hadoop-yarn.png|YARN架构中心
 +
image:hadoop-docker.png|Docker镜像
 +
image:openstack-docker-hadoop.png|与OpenStack/Docker的融合
 +
image:Spark-and-Map-Reduce-Differences.png|Spark和MapReduce
 
</gallery>
 
</gallery>
  
第113行: 第175行:
 
*[http://en.community.dell.com/dell-blogs/direct2dell/b/direct2dell/archive/2011/08/04/introducing-the-dell-cloudera-solution-for-hadoop-harnessing-the-power-of-big-data.aspx Dell开始销售打包Apache Hadoop解决方案] 采用Cloudera分发版,运行在 Dell PowerEdge C2100 服务器和 Dell PowerConnect 6248 交换机上。
 
*[http://en.community.dell.com/dell-blogs/direct2dell/b/direct2dell/archive/2011/08/04/introducing-the-dell-cloudera-solution-for-hadoop-harnessing-the-power-of-big-data.aspx Dell开始销售打包Apache Hadoop解决方案] 采用Cloudera分发版,运行在 Dell PowerEdge C2100 服务器和 Dell PowerConnect 6248 交换机上。
 
*[http://hortonworks.com/ Hortonworks Data Platform]
 
*[http://hortonworks.com/ Hortonworks Data Platform]
 +
 +
==博客==
 +
*[https://developer.yahoo.com/blogs/hadoop/ Yahoo Hadoop Blog]
 +
*[http://zh.hortonworks.com/blog/ Hortonworks Blog]
 +
*[http://blog.cloudera.com/ Cloudera Blog]
 +
*[http://dongxicheng.org/ 董的博客]
  
 
==链接==
 
==链接==
 +
[[文件:mining-of-massive-datasets.jpg|right|thumb|http://book.huihoo.com/mining-of-massive-datasets/ 下载图书]]
 
*华盛顿大学也从那时开始了一个以Hadoop为基础的[http://docs.huihoo.com/mapreduce/ 分布式计算的课程]  
 
*华盛顿大学也从那时开始了一个以Hadoop为基础的[http://docs.huihoo.com/mapreduce/ 分布式计算的课程]  
*http://hadoop.apache.org/
+
*[http://hadoop.apache.org/ Hadoop官网]
 
*http://wiki.apache.org/hadoop/
 
*http://wiki.apache.org/hadoop/
 +
*[https://amplab.cs.berkeley.edu/ 伯克利AMPLab]
 
*[http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html Yahoo's Hadoop Support]  
 
*[http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html Yahoo's Hadoop Support]  
 
*[http://www.infoq.com/cn/news/2007/08/hadoop-momentum 类似Google构架的开源项目Hadoop近获社区关注]
 
*[http://www.infoq.com/cn/news/2007/08/hadoop-momentum 类似Google构架的开源项目Hadoop近获社区关注]
第125行: 第195行:
 
*[http://www.infoq.com/news/2006/11/hadoop-ec2 Run Your Own Google Style Computing Cluster with Hadoop and Amazon EC2]   
 
*[http://www.infoq.com/news/2006/11/hadoop-ec2 Run Your Own Google Style Computing Cluster with Hadoop and Amazon EC2]   
 
*[http://docs.huihoo.com/apache/hadoop/ Hadoop文档]
 
*[http://docs.huihoo.com/apache/hadoop/ Hadoop文档]
 +
*[http://docs.huihoo.com/apache/hadoop/1.0.4/cn/index.html Hadoop中文文档]
 
*[http://download.huihoo.com/apache/hadoop/ Hadoop下载]
 
*[http://download.huihoo.com/apache/hadoop/ Hadoop下载]
  
{{Comment}}
+
[[category:hadoop]]
 
+
[[category:big data]]
[[Category:Search Engine]]
+
[[category:search engine]]
[[Category:Apache Hadoop]]
+
[[category:distributed computing]]
[[Category:Apache]]
+
[[category:apache]]
 +
[[category:yahoo]]
 +
[[category:hortonworks]]
 +
[[category:cloudera]]

2017年12月28日 (四) 12:46的版本

Apache-Hadoop.jpg

Apache Hadoop是一个软件平台,可以让你很容易地开发和运行处理海量数据的应用。Hadoop是MapReduce的开源实现,它使用了Hadoop分布式文件系统(HDFS)。MapReduce将应用切分为许多小任务块去执行。出于保证可靠性的考虑,HDFS会为数据块创建多个副本,并放置在群的计算节点中,MapReduce就在数据副本存放的地方进行处理

对于一个大文件,hadoop把它切割成一个个大小为64Mblock。这些block是以普通文件的形式存储在各个节点上的。默认情况下,每个block都会有3个副本。通过此种方式,来达到数据安全。就算一台机器down掉,系统能够检测,自动选择一个新的节点复制一份。

在hadoop中,有一个master node和多个data node。客户端执行查询之类的操作,只需与master node(也就是平时所说的元数据服务器)交互,获得需要的文件操作信息,然后与data node通信,进行实际数据的传输。

master(比如down掉)在启动时,通过重新执行原先的操作来构建文件系统的结构树。由于结构树是在内存中直接存在的,因此查询操作效率很高

核心:Hadoop Distributed File System

HBase: Bigtable-like structured storage for Hadoop HDFS

Hadoop On Demand

船底星座星云图

目录

新闻

简介

Hadoop Becomes the New core of the Analytical enterprise.

Apache Hadoop 是一种用于分布式存储和处理商用硬件上大型数据集的开源架构。Hadoop 可使企业迅速从海量的结构化和非结构化数据中获取洞察力。

Hadoop的12个事实

分析师给出关于Hadoop的12个事实

  • 事实1:Hadoop是由多个产品组成的。
  • 事实2:Apache Hadoop是开源技术,但专有厂商也提供Hadoop产品。
  • 事实3:Hadoop是一个生态系统,而非一个产品。
  • 事实4:HDFS是文件系统,而不是数据库管理系统。
  • 事实5:Hive与SQL类似,却非标准SQL。
  • 事实6:Hadoop与MapReduce相互关联,但不相互依赖。
  • 事实7:MapReduce提供的是对分析的控制,而不是分析本身。
  • 事实8:Hadoop的意义不仅仅在于数据量,更在于数据的多样化。
  • 事实9:Hadoop是数据仓库的补充,不是数据仓库的替代品。
  • 事实10:Hadoop不仅仅是Web分析。
  • 事实11:大数据不一定非Hadoop不可。
  • 事实12:Hadoop不是“免费午餐”。

项目

Quick Start

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Use the following conf/hadoop-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>localhost:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost 

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t dsa -P  -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Format a new distributed-filesystem:

$ bin/hadoop namenode -format
....
08/03/19 11:15:41 INFO dfs.Storage: Storage directory /tmp/hadoop-allen/dfs/name has been successfully formatted.

Start The hadoop daemons:

$ bin/start-all.sh 

Browse the web-interface for the NameNode and the JobTracker, by default they are available at:

NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/

Copy the input files into the distributed filesystem:

$ bin/hadoop dfs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Copy the output files from the distributed filesystem to the local filesytem and examine them:

$ bin/hadoop dfs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hadoop dfs -cat output/*

When you're done, stop the daemons with:

$ bin/stop-all.sh
Hadoop-logo.gif

Tutorial

Python

Projects

Developer

Yahoo

Yahoo! Distribution of Hadoop

Powered By

Yahoo Hadoop Clusters

We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.Currently have around a hundred machines - low end commodity boxes with about 1.5TB of storage each. Our data sets are currently are of the order of 10s of TB and we routine process multiple TBs of data everyday.In the process of adding a 320 machine cluster with 2,560 cores and about 1.3 PB raw storage. Each (commodity) node will have 8 cores and 4 TB of storage.We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features (that we will open source at some point). We have also written a read-only FUSE implementation over hdfs.

>5000 nodes running Hadoop as of July 2007, biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each), Used to support research for Ad Systems and Web Search. Also used to do scaling tests to support development of Hadoop on larger clusters

25 node cluster (dual xeon LV, 1TB/node storage), Used for charts calculation and web log analysis

up to 400 instances on Amazon EC2, data storage in Amazon S3

Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.

More: http://wiki.apache.org/hadoop/PoweredBy

发行版

文档

图集

商业

Hadoop-Dell-185x177.jpg

博客

链接

分享您的观点
个人工具
名字空间

变换
操作
导航
工具箱