Log in

Shantanu Oak

Feb. 25th, 2017

02:47 pm - Tesseract command and API

Tesseract is available as a command and through API

wget https://s3.amazonaws.com/datameetgeo/CVC1.jpg

alias tesseract='docker run --rm -v `pwd`:/work -w /work vimagick/tesseract'

tesseract CVC.jpg read_scan.txt


# Install docker
yum install -y docker git mysql

# Install docker-compose
curl -L https://github.com/docker/compose/releases/download/1.8.0/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

cd /tmp/
git clone https://github.com/tleyden/open-ocr.git
cd open-ocr/docker-compose/

/usr/local/bin/docker-compose up

# test from loalhost:
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"https://s3.amazonaws.com/datameetgeo/CVC1.jpg","engine":"tesseract"}' http://`hostname -i`:9292/ocr

# or from some other website using API
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"https://s3.amazonaws.com/datameetgeo/CVC1.jpg","engine":"tesseract"}' > studyt.xt

# Use OCR API for local file processing

cd /tmp/open-ocr/docs
./upload-local-file.sh CVC.jpg image/jpg

Current Mood: learning

Feb. 24th, 2017

08:02 pm - Tips to save Amazon cost

1) Select either Amazon linux or Ubuntu. Redhat and/ or Windows AMIs are very costly.
2) Always use spot EC2 instances
3) Instead of using SSD use Magnetic volume type.
4) Add tags so you can get tag-wise break-up of bill. For e.g. type: testing
5) Create an Image and terminate running instance. Even stopped instances are charged.

1) Always use compressed file.
2) Transfer to Glacier after a few days.

1) Use a single region. For e.g. us-east-1
2) Create a billing alarm in cloudwatch. This will send you email alert when-ever the bill is above given limit.

Current Mood: frugal

Feb. 17th, 2017

01:54 pm - Shell script to generate mysql report

Here is the shell script that will create a report of count per table.

IPADD=`hostname -i`
for PORT in {3307..3321}

for i in `mysql -h$IPADD -uroot -pindia$PORT -P $PORT -Bse"select DISTINCT(TABLE_SCHEMA) from information_schema.tables where TABLE_SCHEMA NOT IN ('information_schema', 'mysql', 'performance_schema', 'awsdms_control')"`
mysqlshow -h$IPADD -uroot -pindia$PORT -P $PORT $i >> count1_stn$PORT.txt 2>> count1_err$PORT.txt &


## shell script to get the count acorss all mysql databases faster (less accurate) than above method

for PORT in {3307..3309}
mysql -h172.31.29.120 -uroot -pindia$PORT -P $PORT -Bse" select '$PORT' as port, TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS from information_schema.tables where TABLE_SCHEMA not in ( 'performance_schema' , 'mysql', 'information_schema')" >> count_stn.txt 2>> count_err.txt

This cron is recommended:

* * * * * mysqladmin status -h172.31.29.120 -uroot -pindia3308 -P 3308 >> /root/success.txt 2>> /root/err.txt

# dump all tables per file from test database

for i in `mysql test -Bse"show tables"`
mysqldump test --no-data $i > $i.sql
gzip $i.sql


# start all docker containers
for f in `docker ps -aq | grep -v docker`; do docker start $f; done

## remove primary key of pune and add region names for all

cat alter.sh
IPADD=`hostname -i`

time mysql -h$IPADD -uroot -pindia3310 -P 3310 -Bse"select concat('alter table ', TABLE_SCHEMA,'.',TABLE_NAME , ' DROP primary key, add column region varchar(100) default \"pune\";') from information_schema.tables where TABLE_SCHEMA NOT IN ('information_schema', 'mysql', 'performance_schema')" | mysql -h$IPADD -uroot -pindia3310 -P 3310 --force > pr.stn 2> pr.err &

while read -r REGION PORT

time mysql -h$IPADD -uroot -pindia$PORT -P $PORT -Bse"select concat('alter table ', TABLE_SCHEMA,'.',TABLE_NAME , ' add column region varchar(100) default \"$REGION\";') from information_schema.tables where TABLE_SCHEMA NOT IN ('information_schema', 'mysql', 'performance_schema')" | mysql -h$IPADD -uroot -pindia$PORT -P $PORT --force > stn_$PORT.txt 2> err_$PORT.txt &

done << heredoc
aurangabad 3315
konkan 3314
latur 3313
nagpur 3312
nashik 3311

## merge all regions on pune

cat merge.sh
IPADD=`hostname -i`
for PORT in {3311..3316}
time mysqldump --all-databases -h$IPADD -uroot -pindia$PORT -P $PORT --no-create-db --no-create-info --complete-insert | mysql -h$IPADD -uroot -pindia3310 -P 3310 --force > portsuccess$PORT.txt 2>> porterror$PORT.txt &

## sample of 100 records per table

# cat sample.sh
cd /tmp/
for PORT in {3307..3321}
mysqldump -h172.31.24.21 -uroot -pindia$PORT -P $PORT --all-databases --routines --where=" 1 = 1 limit 100" > port$PORT.sql 2> porterror$PORT.txt
gzip port$PORT.sql

# copy all dump files to S3 sub-folder sample
# aws s3 cp /tmp/ s3://datameetgeo/sample/ --recursive --include "*.sql.gz"

## restore sample data
cat restore_sample.sh
# aws s3 sync s3://datameetgeo/sample .
# gunzip *.sql.gz

IPADD=`hostname -i`

for PORT in {3307..3321}
mysql -h$IPADD -uroot -pindia$PORT -P $PORT < port$PORT.sql > port_success$PORT.txt 2> port_err$PORT.txt


## show processlist across all ports
for PORT in {3307..3321}
mysql -h`hostname -i` -uroot -pindia$PORT -P $PORT -e"show processlist"

Current Mood: developer, dba

Feb. 11th, 2017

07:50 pm - sql server on ec2 instance

Here are the 4 steps to check the sql-server is running properly on ec2 instance.

1) Created the database called "test".

2) Created user:
create login dba1 with password ='India162)';

use test
create user dba1 for Login dba1

use test
exec sp_addrolemember 'db_owner', dba1

3) Select security tab from Properties of database and choose "SQL Server and windows authentication" -

4) Restart sql-server

5) Check the instance is listening on all ports

Current Mood: ashtonished

Feb. 6th, 2017

09:16 am - Cost of Windows server on AWS

I was looking for windows server for a project. I came across this...

Windows server 2008 R2 with SQL server Express and IIS ami-3334d425

This version is available as default image that cost around 20% more than linux. But I was looking for enterprise version.

Windows Server 2008 R2 with SQL Server 2012 Enterprise on Windows 2008 R2 R2 6.1.7601

This image is available in marketplace. The cost is 300% more than Linux.

A standard linux server that is available for 0.8 will cost $2.5 for this enterprise version of windows. We can use spot instances and bring down the cost upto 0.2 or 0.3 but this option is not available for enterprise windows server.

low end servers can not be used with this windows version because it requires at least 16 GB RAM. And therefore the cheapest configuration available is m3.xlarge at $2 per hour (Monthly cost around 1 Lakh) We can enter into long-term contract to get discounts with Linux but this is not possible with Wondows server and effectively they are charged 3 to 5 times more for one year contract.

Tags: , ,
Current Mood: shockedshocked

Jan. 4th, 2017

03:51 pm - pandas example dataframe to be used frequently

txt = """
personId date_Recieved
1 2 feb 2016
1 4 feb 2016
1 6 feb 2016
2 10 dec 2016
2 1 jan 2017
2 20 jan 2017

df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python', parse_dates=[1])

Tags: ,
Current Mood: learning

Dec. 24th, 2016

01:12 pm - Processing ipython notebook output

Here is a shell script that will convert all ipython notebooks to python scripts in /home/ directory.

# cat final.sh
for i in `find /home -name "*.ipynb"`
print $i
jupyter nbconvert $i --to script && cat $result_string

If using docker, we can use the following commands to execute the script.

docker cp final.sh f73:/home/

time docker exec -i f73 bash /home/final.sh > stn1.txt 2> err1.txt

Once the output is saved to a single file, we can use grep to select all the commands executed by a user.

grep -A 30 'shantanuo/basic' stn1.txt


Here is the python code that will show all the directories and their contents.

myd = {}
for name in os.listdir(mypath):
subdir = os.path.join(mypath, name)
if not os.path.isdir(subdir):
myd[name] = ', '.join([i for i in os.listdir(subdir) if i != '.ipynb_checkpoints'])

import pandas as pd
df=pd.DataFrame(myd , index=[0]).T

pd.set_option('max_colwidth', 800)

And now you can check what the students have entered in their notebooks...

!jupyter nbconvert --to script /home/lucifer/Basics.ipynb

!cat /home/lucifer/Basics.py | grep -v '^#' | grep -v '^$'

Current Mood: teaching

Dec. 7th, 2016

07:38 pm - docker for AWS

1) Once you have logged in to your AWS account, click on this link...


2) Click on Next. In the drop down of "Which SSH Key to use?" select the key for e.g. dec15a.pem and then click on next. You can choose 1 Manager instead of default 3 and 1 node instead of default 5. You can scale up later if required.

3) Click on Next.

4) Select the check box that says "I acknowledge that AWS CloudFormation might create IAM resources." and then click on "Create".

5) Goto your EC2 instance list and look for the server tagged as "Docker-Manager" and note it's Public IP address. Now connect to that server using ...

ssh -i dec15a.pem docker@

6) Login to your docker-hub account using "docker login" command.

7) Run the following command to start CRM application.

docker service create --with-registry-auth --name mydemo -p 88:80 shantanuo/mylamp

8) Your application should be accessible on port 88 of load balanced IP. For e.g.


Here are other useful commands to understand how docker swarm mode works.

# create stacks
docker service create --name demo -p 80:80 shantanuo/lamp
docker service create --name demo1 -p 81:80 shantanuo/pyrun

# list nodes and services
docker node ls
docker service ls
docker service ps demo

# add more nodes
docker service scale demo=5

# update an image
docker service update --image shantanuo/lamp:latest demo

On DB server for e.g.

docker run -p 3306:3306 -v /dbdata:/var/lib/mysql-e MYSQL_ROOT_PASSWORD=india -d mysql

On application server:

docker run --name mywp -p 8082:80 -e WORDPRESS_DB_HOST='' -e WORDPRESS_DB_NAME='myworld' -e WORDPRESS_DB_USER='root' -e WORDPRESS_DB_PASSWORD='india' -d wordpress

Using Docker swarm:

docker service create --name mywpswarm -p 8082:80 -e WORDPRESS_DB_HOST='' -e WORDPRESS_DB_NAME='myworld' -e WORDPRESS_DB_USER='root' -e WORDPRESS_DB_PASSWORD='india' wordpress

Current Mood: hopefulhopeful

Dec. 3rd, 2016

06:44 pm - Slice a dataframe

Slicing a DataFrame against a labelled index is done using DataFrame.loc[]. Try the following examples and see what is returned:

Select information from page 17:

Select ‘body’ section of page 17:
tl.loc[(17, 'body'),]

Select counts of the word ‘Anne’ in the ‘body’ section of page 17:
tl.loc[(17, 'body', 'Anne'),]

The levels of the index are specified in order, so in this case the first value refers to ‘page’, then ‘section’, and so on. To skip specifying anything for an index level – that is, to select everything for that level – slice(None) can be used as a placeholder:

Select counts of the word ‘Anne’ for all pages and all page sections
tl.loc[(slice(None), slice(None), "Anne"),]

Finally, it is possible to select multiple labels for a level of the index, with a list of labels (i.e. ['label1', 'label2']) or a sequence covering everything from one value to another (i.e. slice(start, end)):

Select pages 37, 38, and 52
tl.loc[([37, 38, 52]),]

Select all pages from 37 to 40
tl.loc[(slice(37, 40)),]

Select counts for ‘Anne’ or ‘Hilary’ from all pages
tl.loc[(slice(None), slice(None), ["Anne", "Hilary"]),]

The reason for the comma in tl.loc[(...),] is because columns can be selected in the same way after the comma. Pandas DataFrames can have a multiple-level index for columns.


Current Mood: learning

Nov. 30th, 2016

07:05 pm - Dark Patterns everywhere!

There is very good article about Dark Patterns that can be found here...


Unlike its predecessors, Windows 10 doesn’t offer a way to turn off automatic Windows updates. It consumes all my bandwidth and I have to pay the internet service provider even if I have just kept the PC turned on (without doing anything!) Newer versions of Windows make it almost impossible to change default browser and even if you do, it will change it back to microsoft browser on next update. No one knows when that next update is due and how much data it is going to fetch. Once the update is installed, it will automatically restart your PC and you will loose any running processes.

Windows is not just another Operating system. Once you start using it, it becomes your way of life. It dictates you and limits your creativity. But this is expected by a commercially motivated company with global ambitions. But what about MySQL that has open source roots? They are hiding the download link to get more registrations. Great! The problem is that "log-in" and "register" links do not work as expected. If you have a very old mysql account then it may or may not be integrated nicely with oracle account.

Google has discontinued some of the very useful tools like feed reader and google code that hosted massive number of open source repositories. But I still appreciate google - because I was given several months of time to move data elsewhere. Other companies may or may not be that kind. For e.g. when I did not renew domain name registered with Yahoo, they simply refused to release the domain after expiration. I could not re-register it with some other provider. I think individuals or company decisions depend upon 1) attitude 2) competition and 3) priority

Apart from the 3 obvious factors mentioned above, there is one more driving force which is subtle and goes unnoticed. There are a lot of individuals who contribute to open source, help others selflessly or build communities around an idea. These people are not "saints" and have their own motivations. But apparently they have zero or very low expectations. Such "volunteers" force is becoming apparent now and forcing the big monolithic giants to either change or fail.

Current Mood: worriedworried

Navigate: (Previous 10 Entries)