Sunday, May 22, 2016

Hdfs important commands



hdfs dfs –ls file:///bin

hdfs dfs –ls hdfs:///root

file://, swift://, or s3://

hdfs dfs –help

hdfs dfs –mkdir /user/steve/dir1

List the contents of a directory recursively:

hdfs dfs –ls –R /user/steve

hdfs dfs –put

hdfs dfs –appendToFile

hdfs dfs –cat fileA

View only the last 1 KB of a file:

hdfs dfs –tail fileB

hdfs dfs –get /user/steve/fileB /home/steve/fileB

hdfs dfs –mv /user/steve/fileC /user/steve/dir1/fileC



hdfs dfs –mv /user/steve/fileC /user/steve/dir1/fileC



hdfs dfs –getmerge fileF fileG



hdfs dfs –rm fileB fileC



hdfs dfs –rmdir /user/steve/dir1/dir2



Trash is configured by two properties, fs.trash.checkpoint.interval in core-default.xml and fs.trash.interval in core-site.xml.



hdfs dfs –D dfs.blocksize=<N>  -D dfs.replication=5 –put



hdfs dfs –chown danielle /data/weblogs/fileA



hdfs dfs –chgrp hdfs fileB



Change owner and group membership simultaneously:

hdfs dfs –chown hcat:hdfs /data/weblogs/fileC

The command requires HDFS superuser privileges.

Using setfacl on Files

Set (remove existing and replace) both permissions and ACL entries using a single command:

$ hdfs dfs –setfacl --set user::rw-,group::r--,other::---,user:steve:rw-

,user:jason:rw- fileA

Sets owner, group, and other permissions and adds ACL entries for steve and Jason



Modify an existing ACL by adding a new entry for the group eng:

$ hdfs dfs –setfacl –m group:eng:rw- fileA



Remove the specific ACL entry for the user jason:

$ hdfs dfs –setfacl –x user:jason fileA



Remove all ACL entries, leaving only the base owner, group, and other permissions:

$ hdfs dfs –setfacl –b fileA

The user steve and the group eng are removed.



Using setfacl on Directories

The setfacl command can also be used on directories, but with a few differences. For example, the

R recursive option can be used to set, modify, or remove permissions and ACL entries from an entire

directory hierarchy.

hdfs dfs –setfacl –R … dir1 recursively sets, modifies, or removes ACL entries.

hdfs dfs –setfacl –m default:user:jason:rw- dir1

Any directory with default ACL entries must include default entries for the owner, group, and other user classes.



hdfs dfs –setfacl –m default:mask::r-- dir1 explicitly sets a default mask. A default mask on a directory helps to define the permissions and ACLs inherited by child files and directories. union of the permissions for which includes the unnamed group, and any named users or named groups listed in the ACL.



The other type of mask is the access mask on a file or directory.

hdfs dfs –setfacl-m mask::r- — fileA. The purpose of an access mask is to provide a user with a mechanism to quickly limit or restore the effective permissions for multiple users and groups using a single command . An access mask effects any named users, named groups, or the unnamed group.



hdfs dfsadmin –report



hdfs dfs –du -h



hdfs fsck –files –blocks –locations –racks



hdfs fsck –openforwrite



hdfs fsck –move



hdfs fsck –delete



hdfs fsck /user/root 

–files

–blocks

–locations

–racks



hdfs dfsadmin -help.



To transition a NameNode into safemode:

hdfs dfsadmin –safemode enter



To force a NameNode checkpoint operation that creates both a new fsimage and edits file:

hdfs dfsadmin –saveNamespace



To create only a new edits file:

hdfs dfsadmin –rollEdits



To exit NameNode safemode:

hdfs dfsadmin –safemode leave



To download the latest fsimage file (useful for doing remote backups):

hdfs dfsadmin –fetchImage



hdfs dfsamdin –report



Configuring Quotas

You must be an HDFS superuser to administer quotas.

Setting a name quota on one or more directories:

hdfs dfsadmin –setQuota <n> <directory> [<directory>] …



Issue the command again to modify a name quota.

Removing a name quota on one or more directories:

hdfs dfsadmin –clrQuota <directory> [<directory>] …



Setting a space name quota on one or more directories:

hdfs dfsadmin –setSpaceQuota <n> <directory> [<directory>] …

Issue the command again to modify a space quota.



Removing a space quota on one or more directories:

hdfs dfsadmin –clrSpaceQuota <directory> [<directory>] …

An attempt to set a name or space quota will still succeed even if the directory would be in immediate

violation of the new quota.



hdfs dfs –count –v –q <directory_name> Any user may view current quota information using the HDFS Shell count command.



hdfs balancer



Changing the threshold to 5 percent:

hdfs balancer –threshold 5



Display other options:

hdfs balancer -help



hdfs dfsadmin –refreshNodes and yarn rmadmin –refreshNodes



hdfs fsck –racks utility displays the number of racks of which the NameNode is aware.



hdfs haadmin –getServiceState.



hdfs haadmin –failover.



hdfs dfsadmin –allowSnapshot <directory_path>.



hdfs dfsadmin –disallowSnapshot <directory_path>.



The hdfs lsSnapshottableDir commands lists any snapshottable directories.



hdfs dfs –renameSnapshot <directory_path> <old_name> <new_name>.



hdfs dfs –createSnapshot <directory_path> [<snapshot_name>].



hdfs dfs –deleteSnapshot <directory_path> <snapshot_name>



hadoop distcp –help

hadoop distcp

hdfs://<namenode1>:8020/<source1> hdfs://<namenode1>:8020/<source2>



hadoop distcp –f hdfs://<namenode1>:8020/<source_list>

hdfs://<namenode2>:8020/<destination>.



Hadoop DistCp -update option



The -update and -overwrite options



distcp command includes a –m <n> option This is the default mode although it can be explicitly specified by adding the –strategy uniformsize option.



Distcp is basically run in static mode and dynamic mode (In static mode the mappers that finished early must wait, With static mode, mappers that are faster finish early and are not assigned any more groups to process. Mappers that are slower still must process all of the files assigned to them. This is the default

mode although it can be explicitly specified by adding the –strategy uniformsize option.)



Hadoop distcp  -m 20 –strategy uniformsize



Hadoop distcp  -m 20 –strategy dynamic



  async                 Should distcp execution be blocking

 -atomic                Commit all changes or none

 -bandwidth <arg>       Specify bandwidth per map in MB

 -delete                Delete from target, files missing in source

 -f <arg>               List of files that need to be copied

 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n

 -i                     Ignore failures during copy

 -log <arg>             Folder on DFS where distcp execution logs are

                        saved

 -m <arg>               Max number of concurrent maps to use for copy

 -mapredSslConf <arg>   Configuration for ssl config file, to use with

                        hftps://

 -overwrite             Choose to overwrite target files unconditionally,

                        even if they exist.

 -p <arg>               preserve status (rbugp)(replication, block-size,

                        user, group, permission)

 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n

                        bytes

 -skipcrccheck          Whether to skip CRC checks between source and

                        target paths.

 -strategy <arg>        Copy strategy to use. Default is dividing work

                        based on file sizes

 -tmp <arg>             Intermediate work path to be used for atomic

                        commit

 -update                Update target, copying only missingfiles or

                        Directories

NAMENODE HA set up



Total of  6 NODE cluster details:

NODE1 :HDFS master component ,Zookeeper Sever ,Ambari Agent, Journal node,         Resource Manager ,App Timeline Server, History Server, Hiveserver 2

NODE 2:HDFS master component, Zookeeper Server, Ambari Agent, Journal node

NODE 3:Amabri server, zookeeper server, Journal Node, Clients, Hive Metastore, WebHcat Server, clients, Hive server 2, Metrics Collector

NODE 4:Ambari Agent, HDFS Worker component, Node Manager component, Hive Client,Pig

NODE 5:Ambari Agent, HDFS Worker component, Node Manager component, Hive client,Pig

NODE 6:Ambari Agent, HDFS Worker component, Node Manager component, Hive client,Pig


1)     If necessary, use Ambari Web UI > Services > ZooKeeper > Service Actions >
Add ZooKeeper Server to add more ZooKeeper servers.(3 servers minimum for Namenode HA configuration)
2)     Ambari click Services > HDFS > Service Actions > Enable NameNode HA. This opens a configuration wizard.
In the  service action drop down list  enable Namenode HA
3)     Review:In the Getting Started window, type the Nameservice ID. The Nameservice ID is the logical name of the HDFS cluster.In the wizard you have to enter different properties in the each steps of GUI screens

--- Logical Name (dfs.nameservices)

--fs.defaultFS(In core-site.xml the default path prefix used by the Hadoop FS client when none is given)

---installation (NameNode Current On NODE 1) and (Additional Namenodes on NODE 2)

---(Journal Nodes one on Current Namenode NODE 1), (Second on Additional Namenode NODE 2) ,(third journal node on   NODE 3).

---On current journal nodes on their installation paths in hdfs-site.xml set the property
Dfs.journalnode.edits.dir =”/path/to/edits/info/data”  where editlogs are stored in the directory paths.

--Locating journal nodes will be set by property in hdfs-site.xml in
Dfs.namenode.shared.edits.dir “qjournal://jn1:8485;jn2:8485;j3:8485”

--dfs.nameservices =”haclustersetup”(The logical hdfs cluster name points to the two namenodes)

-- dfs.ha.namenode.haclustersetup=”nn1,nn2”(names of namenodes)

--dfs.namenode.http-address.<logical clustername>.<names of nodes>

Ex:dfs.namenode.http-address.<haclustersetup>.<nn1>= “node1:50070”
dfs.namenode.http-address.<haclustersetup>.<nn2>= “node2:50070”


--dfs.namenode.rpc-address.<logical clustername>.<name of node>
Ex:dfs.namenode.rpc-address.<haclustersetup>.<nn1>= “node1:8020”
dfs.namenode.rpc-address.<haclustersetup>.<nn2>= “node2:8020”


-- dfs.ha.fencing.methods(values: shell or sshfence)

-- dfs.client.failover.proxy.provider.mycluster property determines the Java class used by the client to determine which NameNode is currently the Active NameNode.”org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider”

4)     Manually Create a checkpoint
--sudo su hdfs -1 –c ‘Hdfs dfsadmin safemode enter’
--sudo su hdfs -1 –c ‘hdfs dfsadmin safeNameSpace’

5)     Manually initialize the journal nodes
----sudo su hdfs -1 –c ‘hdfs namenode  initalizeShareEdits’

6)     Manually initialize the metadata for namenode automatic failover by running  
--sudo su hdfs -1 –c ‘Hdfs zkfc formatZk’

7)     Manually initialize the metadata for the additional namenode by running
--sudo su hdfs -1 –c ‘Hdfs namenode bootstrapStandby’

8)     hdfs haadmin –getServiceState.(to get service state)

9)     hdfs   haadmin –failover (to manual initiate a failover)