Wednesday, April 25, 2012

Python split file based on json tag


'''A script to split places_dump_US.geojson into small files based on the json tag, province'''
'''script start'''


import json

def writeState(m,state='none'):
    out_name='places_dump_US.'+state+'.geojson';
    tg=open(out_name,'a',-1)
    print(json.dumps(m),file=tg)
    tg.close

sf=open('places_dump_US.geojson','r',-1)
count=0;
for line in sf:
    try:
        p=json.loads(line)
        cstate = p.get('properties').get('province')
        writeState(p,cstate)
        print(count)
        count+=1
    except:
        writeState(p)
        print(count)
        count+=1
sf.close


'''script end'''

Monday, April 23, 2012

MongoDB cheatsheet

Import Data
//import SourceFile (csv file) into DataBaseName CollectionName with firstline as header
mongoimport -d DataBaseName -c CollectionName --type csv --file SourceFile --headerline

//import file,xxx.txt into foo in Collection bar without headerinfo(have to specify the fieldnames)
mongoimport -d foo -c bar --type csv --file xxx.txt -f id,timestamp,latitude,longitude,place

mongoimport -d foo -c bar --type tsv --file xxx.txt -f id,timestamp,latitude,longitude,place

Create Index

db.[collection-name].ensuerIndex({[tag-name]: 1})

Create Spatial Index

db.[collection-name].ensureIndex({[tag-name]:"2d"})




Python SimpleHTTPServer

Type the following in the direction which you want to share.

python -m SimpleHTTPServer 9999

or

python -m http.server 8001

//this command create a simple http server on local port 9999 using SimpleHTTPServer module.

Then other users can get the files in the directory using a browser.


Monday, April 16, 2012

Spatial data mining procedures

Procedure
1. import json files
2. build index
3. build foreign keys
4. build tiles
5. build neighborhood index (orignal oid, target oid, distance, direction, topology)
6. spatial data mining tools

Platforms
1. PostgreSQL + postgis (open source solution)
2. MongoDB
3. MS-SQL
4. Hadoop

Input
1. Place_dump_US
2. Checkin

Output
1. OpenLayers + Geoserver (visualization)
2. Patterns (representation)





Sunday, April 15, 2012

Hadoop Clusters Setup

Pre-setup
1. install jdk/jre1.6 or up
2. install ssh
     a. create master sshkey
         ssh-keygen -t dsa -P ""
         cat id_dsa.pub>>authorized_keys
     b. copy master public key to slaves
         scp id_dsa.pub slaveN:~/.ssh/master.pub
      c. add master pub key to authorized_keys
         cat master.pub>>authorized_keys
      d. from master, ssh to slaveN and check if a passphrase is needed.
3. edit /etc/hosts & /etc/hostname

Setup
1. setup env.xml (export JAVA_HOME)

2. core-site (specify name node and jobtracker) --for master & slaves         

                       fs.default.name
                       hdfs://master
       
3. hdfs-site.xml (data node) --for master & slave     

                       dfs.name.dir
                       /home/hduser/hddata/name
             
                       dfs.data.dir
                       /home/hduser/hddata/data
         
4. mapred-site.xml (jobtracker) --for master & slaves   

                       mapred.job.tracker
                       master:54311
           
5. list all slaves to conf/slaves --for master/jobtracker only

6. chmod g-w to all data and name directories

** start-dfs.sh will consult slaves on name-node and start all data nodes on slaves.
** start-mapred.sh will consult salves on job-tracker-node and start all task-trackers on slaves.

Startup
 1. execute "hadoop namenode -format" on name node site

 2. execute "start-dfs.sh" on name node site

 3. execute "start-mapred.sh" on job tracker site

Shutdown
 1. execute "stop-mapred.sh" on job tracker site

 2. execute "stop-dfs.sh" on name node site